Common protein surface shapes and uses therefor

ABSTRACT

A method of determining common three-dimensional structural features of protein surfaces is provided, as is use of representations of these common structures in molecular database searching and in designing focussed molecular libraries. The method is particularly concerned with the analysis and representation of protein surfaces such as b-turns, loops and contact surfaces. In one form, the method identifies common locations and orientations of amino acid side-chains, simplified as Cα-Cβ vectors. In another form, the method identifies common regions of surface charge represented by grid points in three-dimensional space. Further provided are common three dimensional structural features of proteins that can be used to search molecular databases for the purposes of identifying molecules that match these common three dimensional structural features. The common three dimensional structural features can also be used to focus de novo molecular generation to produce libraries containing molecules that have these common three dimensional structural features. Libraries of these structurally-related molecules may then be produced for the purposes of drug discovery.

FIELD OF THE INVENTION

THIS INVENTION relates to a method of determining commonthree-dimensional structural features of proteins and use ofrepresentations of these common structures in molecular databasesearching, in molecular engineering and in designing focussed molecularlibraries. More particularly, this invention relates to theidentification and representation of protein surfaces such as α-turns,loops and contact surfaces and the determination of grid pointsdescribing surface charge, or the determination of common locations andorientations of amino acid side-chains, simplified as Cα-Cβ vectors.These protein surfaces are typically involved in interactions with othermolecules such as other proteins, nucleic acids, metal ions, antigens,drugs and toxins although without limitation thereto. This inventiontherefore provides common three-dimensional structural features that canbe used to search molecular databases for the purposes of identifyingmolecules that match these common three dimensional structural features.The common three dimensional structural features can also be used toengineer de novo molecules or molecular libraries that have one or morecommon three dimensional structural features. Molecules and molecularlibraries may be useful for the purposes of drug discovery.

BACKGROUND OF THE INVENTION

The chemical diversity possible amongst the suspected 10180 possibledrug-like molecules is immense, and a given combinatorial library canonly hope to capture a tiny fraction of this diversity space. Molecularlibrary design strategies use chemoinformatic techniques to select adiverse set of molecules for library synthesis. The molecular selectionprocess involves the calculation of the chemical characteristics of eachmember of the library, using hundreds of chemical descriptors. It istherefore possible to derive a “diverse” library, where the moleculesdiffer from each other as much as possible in descriptor space, or a“focussed” library, where the molecules are similar in descriptor spaceto a known active. With hundreds of potential descriptors it isdifficult to know which descriptors are important or essential fordescribing biological activity. These approaches consequently optimiselibraries in the chemical universe but do not identify molecules thatcould modulate biological function.

It is becoming evident that the synthesis of large combinatoriallibraries makes sense only if guided by sound library design principles.It is generally accepted that focussing libraries can lead to a 10-100fold increase in the discovery of “hits” (i.e candidate or leadmolecules).

A significant number of pharmaceutical targets involve the mimicking orinhibition of protein interactions with other molecules. With the rapidadvance of the human genome project, it is likely that many more proteininteraction targets will be identified.

Proteins are amino acid polymers that fold into a globular structure.This globular store, in general, has a hydrophobic interior. Thestructure of proteins is defined by the polymeric nature of the backboneand includes secondary structure elements such as helices, sheets, loopsand turns. Whilst the description of protein structure by the nature ofits polymeric backbone (its “skeleton”) is useful for comparing oneprotein to another, it is not useful when describing the structuralelements of various molecular recognition events of proteins. This isbecause molecular recognition is a surface phenomenon and proteins uselarge flat surface areas ranging from 1150 to 4660 Å², comprising onaverage 211 atoms from 52 amino acid residues. These binding surfacesmay be continuous (such as β-turns and loops), or discontinuous surfacesthat comprise 1-11 segments (where a segment is separated by at least 5amino acid residues and can be from a different secondary structure)with an average of 5 segments per interface.

OBJECT OF THE INVENTION

The present inventors have realized that by creating focussed librariesof compounds that mimic common structural elements of protein surfaces,the likelihood of that library containing a molecule that mimics orinhibits a protein-molecule interaction will be enhanced.

It is therefore an object of the invention to identify common elementsof protein surfaces.

It is also an object of the invention to provide a method to identify orto de novo engineer one or more molecules that match common elements ofprotein surfaces.

SUMMARY OF THE INVENTION

The present invention is therefore broadly directed to theidentification of common, protein surface elements as descriptors ofprotein surfaces for use in molecular design, engineering and screening.

In a first aspect, the invention provides a method of producing adescription of a common dimensional protein surface shape including thesteps of:

-   -   (i) identifying a three dimensional surface shape of each of a        plurality of proteins; and    -   (ii) creating one or more descriptors wherein each said        descriptor represents a common surface shape of two or more        proteins of said plurality of proteins.

In one embodiment, the three three-dimensional surface shape isidentified as respective amino acid side-chain locations andorientations of two or more amino acids of each said protein.

According to this embodiment, at step (ii) each said descriptorrepresents a common location and orientation of the respective aminoacid side chains.

Preferably, each amino acid side chain used to produce said descriptoris simplified as a C_(α)-C_(β) vector.

In another embodiment, the three three-dimensional surface shape isidentified as a surface charge distribution of each said protein.

According to this embodiment, at step (ii) each said descriptorrepresents a common charged surface region of two or more proteins ofsaid plurality of proteins.

Preferably, each charged surface region is represented by at least fourgrid points.

According to the invention, said two or more amino acids form at leastpart of a structural feature of each of said two or more proteins.

Preferably, said structural feature is, or comprises, a β-turn, a loopor a contact surface.

In a second aspect, the invention provides a method of identifying oneor more molecules having a common three-dimensional protein surfaceshape, said method including the steps of

-   -   (i) creating a query using one or more descriptors that each        represent a common three-dimensional protein surface shape; and    -   (ii) using said query to search a database and thereby identify        one or more entries in said database that correspond to one or        more molecules that each match said descriptor.

In one embodiment, at step (i), the descriptor represents a common aminoacid side-chain location and orientation of two or more amino acids ofeach of two or more proteins.

In another embodiment, at step (i), the descriptor represents a commonprotein surface charge shape of two or more proteins.

In yet another embodiment, the query comprises:

-   -   (a) a descriptor that represents a common amino acid side-chain        location and orientation of two or more amino acids of each of        two or more proteins; and    -   (b) a descriptor that represents a common protein surface charge        shape of said two or more proteins.

Suitably, according to the second aspect, said query is used to search acomputer-searchable database comprising a plurality of entries.

Preferably, each amino acid side chain used to produce said descriptoris simplified as a Cα-Cβ vector.

Preferably, for the purposes of database searching, Cα-Cβ vectors and/orsurface charge grid points are represented as a distance matrix.

In a third aspect, the invention provides a method of creating a libraryof molecules including the steps of:

-   -   (i) searching a database to identify one or more entries        corresponding to one or more molecules that each match a common        protein surface shape; and.    -   (ii) using at least one of the one or more molecules identified        at step (i) to create a library of molecules.

In a particular embodiment, this third-mentioned aspect includes thestep of creating a library of molecules from the one or more moleculesidentified as step (ii).

Said library of molecules may be a “virtual” library or a syntheticchemical library.

In a fourth aspect, the invention provides a method of engineering oneor more molecules including the steps of:

-   -   (i) creating one or more descriptors that each represent a        common three-dimensional protein surface shape; and    -   (ii) engineering one or more molecules that respectively        comprise one or more structural features according to the or        each descriptor in (i).

Throughout this specification, unless otherwise indicated, “comprise”,“comprises” and “comprising” are used inclusively rather thanexclusively, so that a stated integer or group of integers may includeone or more other non-stated integers or groups of integers.

BRIEF DESCRIPTION OF THE FIGURES AND TABLES

FIG. 1. Distribution of c_(α1)-c_(α4) distances of all four residuessegments that are not helical nor β-sheets and that are found in highresolution and non-homologous structure in Protein Data Bank³¹.

FIG. 2. a) Each β-turn is represented by four C_(α)-C_(β) vectorshighlighted by the dark triangle. b) To aid visualization of the spatialarrangement of the turn after clustering, the four torsional angles θ1,θ2, θ3 and θ4 are used as approximation to the 24 distances. c) The fourtorsional angles are plotted as a vector from (θ1, θ2) (represented bythe symbol ‘x’) to (θ3, θ4).

FIG. 3. Vector plot of the seven clusters obtained from the k^(th)nearest neighbor cluster and the filtered nearest centroid sortingalgorithms. A threshold of 0.65 RMSD was used.

FIG. 4. Vector plots of the eight clusters formed from the clusteringalgorithm and explicit division of cluster three into two clusters.

FIG. 5. Vector plots of the nine clusters formed from the clusteringalgorithm and explicit division of cluster three into two clusters andinclusion of the average structure of type I′in the initial seed. Thelast graph represents the conformations that were rejected and the firstnine graphs represents the nine clusters.

FIG. 6. The β-turns within each of the nine clusters were superimposedonto the cluster's mean structure.

FIG. 7. Top view of the β-turns structures in cluster two superimposedonto its mean structure. The figure shows that the backbone structurescan vary significantly even-though the c vectors are distributeduniformly.

FIG. 8. Superimposition of the mean structures of the nine clusters. Thesuperimposition is based on the three atoms c_(α1) c_(α2) and c_(α3).

FIG. 9. Vector plots of the β-turns in each of the nine types of β-turnsdefined by Hutchinson and Thornton²⁵. The order of the plots are: typeI, II, I′, IV, II′, VIa1, VIa2, VIII and VIb.

FIG. 10. Number of neighboring loops versus its frequencies for variousvalues of NEIGHBOR_LIMIT.

FIG. 11. Plot of (1) number of peaks, (2) number of peaks with greaterthan twenty neighbors and (3) number of unique peaks with greater thantwenty neighbors as a function of NEIGHBOR_LIMIT.

FIG. 12. Filtered centroid sorting algorithm was used with variousTOLERANCE to obtain the 39 clusters. The percentage representation, theintracluster RMSD, the intercluster RMSD and the ratio of the latter twoare calculated and plotted as a function of TOLERANCE.

FIG. 13. The loops are assigned to one of the 39 seeds using variousTOLERANCEs in the filtered centroid sorting algorithm. The resultingloops in cluster one are superimposed and displayed in stereo view toshow the effect of the choice of the tolerance value. Each line connectsthe position of the c_(α) atom in white and the position of the c_(β)atom in gray. TOLERANCE=A) 0.3, B) 0.5, C) 0.7 AND D) 0.9.

FIG. 14. Histogram of the number of loop in each of the 39 clusters.

FIG. 15. Vector plots of the 39 clusters, with cluster number staringfrom one and counting across a row before going to the next row. Thelast cluster, cluster 40, represents all the loops that have not beenclustered according to our filtered centroid sorting algorithm.

FIG. 16. Tree diagram of obtained from average linkage clustering of the39 clusters.

FIG. 17. An example of a bowtie. The function d(x,y) represents theeuclidian distance between point x and point y. (H=Head, T=Tail).

FIG. 18. Algorithm for finding matching frequency of motifs.

FIG. 19. Algorithm for finding peak motifs.

FIG. 20. One-pass algorithm for clustering motifs.

FIG. 21. The span of the one-pass algorithm FIG. 22. The greedyalgorithm for clustering motifs.

FIG. 23. The span of the greedy algorithm for clustering motifs.

FIG. 24. The combined one-pass and greedy algorithm for clusteringmotifs.

FIG. 25. The greedy algorithm with sealevel for clustering motifs.

FIG. 26. The span of the greedy algorithm with sealevel.

FIG. 27. Adaptive sealevel applied in combination with the greedyalgorithm.

FIG. 28. The one-pass algorithm with RMSD tolerance.

FIG. 29. Number of motifs verses size of motifs.

FIG. 30. Highest matching frequency for each family tolerance.

FIG. 31. Representative 4-motif C29.

FIG. 32. Representative 5-motif C30.

FIG. 33. Representative 6-motif C1.

FIG. 34. Representative 7-motif C10.

FIG. 35. Pseudo-code for finding the matching frequency of surfacepatches.

FIG. 36. Pseudo-code for initial patch creation.

FIG. 37. Pseudo-code for the creation of higher order N-patches.

FIG. 38. The complete graph Kg for a 9-patch.

FIG. 39. a) Some scaffolds that match the common β-turns motifs, b) Somescaffold that match the common loop motifs, and c) A scaffold that matcha common six-residue protein-protein interaction surface. The commonmotifs of the queries are shown in thicker lines in section a, b and c.

Table 1. RMSD matrix of the results obtained from filtered centroidsorting refinement of the 7 clusters formed from the fourth cycle of thek^(th)-nearest neighbor clustering algorithm.

Table 2. RMSD matrix of the eight clusters formed from clusteringalgorithm and explicit division of cluster tree into two clusters.

Table 3. RMSD matrix of the nine clusters formed from clusteringalgorithm, explicit division of cluster three into two clusters andexplicit inclusion of the mean of type I′ in the initial seeds.

Table 4. Comparison of the unique peaks obtained using a NEIGHBOR_LIMITof 0.3 with the unique peaks obtained using higher NEIGHOUR_LIMIT of0.4, 0.5, 0.6 and 0.7. ¹The unique-peak number obtained usingNEIGHBOR_LIMIT of 0.3 that is most similar to the unique peak obtainedusing higher NEIGHBOR_LIMIT. ²The unique-peak number obtained using thehigher NEIGHBOR_LIMIT that is shown in the header. ³The RMSD value fromthe superimposition of the unique-peaks obtained using theNEIGHBOR_LIMIT of 0.3 and obtained using the corresponding higherNEIGHBOR_LIMIT.

Table 5. RMSD matrix of the 39 clusters. AU the loops of cluster x (rowx) are superimposed to the peak structures of cluster y (column y) andthe resulting average RMSD obtained is shown in row x and column y.Since the average RMSD for row x, column y is very similar to that ofrow y, column x, the two values are averaged and place in row x andcolumn y where x is greater or equal to y. The intracluster RMSD arehighlighted by background shading. The average intercluster RMSD that iswithin 0.2 from the highest intracluster RMSD of 0.56 are alsohighlighted by background shading.

Table 6. Summary of results for 4-residue motifs.

Table 7. RMSD values for clustering of 4-residue motifs for initialfamilies TOL 0.75 and inter TOL 0.5 with sealevel 0.25 times the peak.

Table 8. Summary of results for 5-residue motifs.

Table 9. RMSD values for clustering of 5-residue motifs for initialfamilies TOL 0.75 and inter TOL 0.7 with sealevel 0.125 times the peak.

Table 10. Summary of results for 6-residue motifs.

Table 11. RMSD values for clustering of 6-residue motifs for initialfamilies TOL 0.75 and inter TOL 0.7 with sealevel 0.125 times the peak.

Table 12. Summary of results for 7-residue motifs.

Table 13. RMSD values for clustering of 7-residue motifs for initialfamilies TOL 0.75 and inter TOL 0.9 with sealevel 0.125 times the peak.

Table 14. The secondary structure classifications as made by DSSP.

Table 15. The secondary structure of the original clusters for 4-residuemotifs.

Table 16. The secondary structure of the original clusters for 5-residuemotifs.

Table 17. The secondary structure of the original clusters for 6-residuemotifs.

Table 18. The secondary structure of the original clusters for 7-residuemotifs.

Table 19. The proportion of motifs not spanned by a single α-helix.

Table 20. The classification of the best 30 seeds of secondary structurenot spanned by a single α-helix.

Table 21. Summary of results for the non-single α-helix seeds.

Table 22. RMSD values for clustering of 4-residue motifs fornon-single-α-helix seeds greedy tolerance 0.5 Å with sealevel 0.25 timesthe peak matching frequency.

Table 23. The secondary structure of the non-single-α-helix clustersclusters for 4-residue motifs.

Table 24. The proportion of motifs in the non-single-α-helix clustersthat do not contain a single α-helix.

Table 25. The secondary structure of the non-single-α-helix clusters for7-residue motifs.

Table 26. The (x, y, z) coordinate of the mean structure of the5^(th)-least-common cluster of the β-turns, loops and surface motifs.

DETAILED DESCRIPTION OF THE INVENTION

The present invention describes the clustering of charged proteinsurface patches or regions, and the clustering of side chains ofcontinuous and discontinuous surfaces of proteins into distinct motifs.These motifs are used as “descriptors” to design molecules that mimicspecific, common protein shapes. This is achieved by using these commonprotein motifs as a screen in a virtual screening of a virtual libraryand in de novo molecular design and engineering. This approach resultsin the discovery of molecules that mimic common protein shapes. Thesynthesis of individual molecules or libraries of molecules that mimicprotein shapes will result in the discovery of biologically activemolecules that are capable of modulating protein function In thisrespect these motifs or descriptors are used to select molecules fromthe vast chemical universe that match common protein shapes.

In order to define the common structural features of proteins used inmolecular recognition, the present inventors have derived a newclassification system to describe protein structure, based on thelocation and orientation of side chains, and based on the shape of thecharged protein surface patches or regions, such as in protein-proteininteraction regions. In particular the present inventors have focussedon defining the side chain arrangements of β-turns and loops, as theseare primarily responsible for molecular recognition, as well as the sidechain arrangements of contact surfaces.

Therefore, the present invention provides the identification of commonprotein motifs and the use of these as descriptors in molecular design.Libraries of molecules that mimic protein shapes should be valuable forthe discovery of biologically active molecules using high throughputscreening methodologies. These libraries would form the foundation forthe discovery of novel drugs.

As used herein, by “protein” is meant an amino acid polymer. Amino acidsmay be D- or L-amino acids, natural and non-natural amino acids as arewell understood in the art. Chemically modified and derivatized aminoacids are also contemplated according to the invention, as are wellunderstood in the art.

A “peptide” is a protein having less than fifty (50) amino acids.

A “polypeptide” is a protein having fifty (50) or more amino acids.

As used herein, a “protein surface shape” is any three-dimensionalproperty or feature of a protein surface, such as may be describedaccording to amino acid side chain location and orientation or bysurface charge distribution.

In one particular embodiment, the property or feature is athree-dimensional side-chain location and orientation of each of two ormore amino acids of a protein.

In another particular embodiment, the property or feature is athree-dimensional charge distribution of one or more surface regions ofa protein.

Suitably, the protein surface shape is of, comprises or derived from, astructural feature of a protein. Such a structural feature may, forexample, be a contact surface that interact with another protein orother molecule such as a nucleic acid, nucleotide or nucleoside (e.g.ATP or GTP) carbohydrate, glycoprotein, lipid, glycolipid or smallorganic molecule (e.g. a drug or toxin) without limitation thereto.Therefore, for the purposes of exemplification, a domain may be aligand-binding domain of a receptor, a DNA-binding domain of atranscription factor, an ATP-binding domain of a protein kinase,chaperonin or other protein folding and/or translocation enzyme, areceptor dimerization domain or other protein interaction domains suchas SH2, SH3 and PDB domains, although the skilled person will appreciatethat the present invention is not limited to these particular examples.

Structural features of proteins may include loops, β-turns or othercontact surfaces, helical regions, extended regions and other proteindomains.

Preferred structural features are in the form of loops, β-turns or othercontact surfaces.

More preferred structural features are loops and contact surfaces.

As used herein, “contact surfaces” are protein surfaces having aminoacid residues that contact or interact with another molecule, such asanother protein. An example of a contact surface is the ligand-bindingsurface of a cytokine receptor, although without limitation thereto.

Contact surfaces may be composed of one or more discontinuous and/orcontinuous surfaces.

By “discontinuous protein surface” is meant a protein surface whereinamino acid residues are non-contiguous or exist in discontinuous groupsof contiguous amino acid residues.

In this regard, it will be appreciated that β-turns and loops areexamples of a “continuous protein surface”. That is, a protein surfacethat comprises a contiguous sequence of amino acids.

According to the invention, it is preferred that the location andorientation in 3D space of each amino acid side-chain is simplified as aCα-Cβ vector.

In one embodiment, a “descriptor” is a representation of common, or atleast topographically related, amino acid side-chain locations andorientations in 3D space derived from two or more amino acids of each oftwo or more proteins. Typically, a descriptor corresponds to a clusterof Cα-Cβ vectors obtained from two or more β-turns, loops, proteincontact surfaces, helices or other structural features. Clusters areessentially groupings of β-turns, loops or protein contact surfaces withcommon 3D topography. Clusters may be created by any algorithm thatcompares similarity and/or dissimilarity between constituent Cα-Cβvectors of β-turns, loops or protein contact surfaces. Examples ofclustering algorithms are provided in detail hereinafter.

In another embodiment, a “descriptor” is a representation of one or morecommon, three-dimensional distributions of charge across one or moresurface regions of two or more proteins.

According to this embodiment, it is preferred that said descriptorrepresents four or more grid points.

Preferably, respective grid points are 0.2 to 2.0 angstrom apart inthree-dimensional (3D) space.

In particular embodiments, respective grid points may be 0.2, 0.5, 1.0,1.2, 1.5 or 2.0 angstrom apart in three-dimensional (3D) space.

It will also be appreciated that grid point dimensions may be modifiedwithin the ranges recited above according as desired. For example,protein surface regions that contribute significantly to protein-proteininteraction may be represented by relatively tighter, less spaced-apartgrid points. Conversely, protein surface regions that have lesscontribution to protein-protein interaction may be represented byrelatively losser, more spaced-apart grid points.

It will further be appreciated that in particular embodiments,descriptors of common surface shape may be in the form of “average”surface shape, inclusive of “mean”, “median” and “mode” surface shape.

Preferably, the common surface shape is a “mean” surface shape.

As already described hereinbefore, a side-chain location and orientationin 3D space, preferably simplified as a Cα-Cβ vector, is required of atleast two amino acids of each said β-turn, loop or contact surface.

In one embodiment, side-chain location and orientation of four β-turn orloop amino acids is required.

In another embodiment, side-chain location and orientation of at leastthree amino acids is required for a contact surface.

In particular embodiments, four, five, six or seven amino acidside-chain locations and orientations are required for a contactsurface.

In some cases, descriptors are produced from protein structuralinformation extracted from a source database such as the Protein DataBank. In such cases, it is preferred that only non-homologous proteinchains with relatively low homology (e.g. no greater than 25%) to otherproteins are used. This reduces biased sampling caused by the presencein the source database of multiple structures that are minor variants ofeach other.

In other cases, descriptors may be produced from protein structuralinformation produced de novo, or from X-ray crystallographic or NMRdeterminations of protein 3D structure.

In light of the foregoing, it will be appreciated that the inventionprovides classification of proteins according to common amino acidside-chain locations and orientations to thereby produce arepresentation of common spatial elements of protein surfaces.

As will be described in more detail hereinafter, the present inventorsto date have identified at least 9 β-turn, 39 loop and 240 proteincontact surface shapes that occur with high frequency and may be usefulin molecular database screening and library design.

For the purposes of database searching, a number of options areavailable for suitable representation of Cα-Cβ vectors, whether as adatabase entry or as a query:—

-   -   (A) as a distance matrix;    -   (B) as a dihedral angle (δ) formed between respective Cα-Cβ        vectors;    -   (C) as angles α₁ and α₂ formed between respective Cα-Cβ vectors.

Explanations of these representations are provided in Lauri & Bartlett⁵⁶and International Publication WO 00/23474.

A preferred representation of Cα-Cβ vectors is as a distance matrix.

It will also be appreciated that a preferred representation of surfacecharge grid points is as a distance matrix.

A computer-searchable database may be an existing database such as theProtein Data Bank, Cambridge Structural Database, Brookhaven Database ormay be a database constructed de novo. For the purpose of databasesearching, entries may be in the form of representations of proteins,peptides or other organic molecules. It is preferred that entries insearchable protein databases are in the form of charged surface or Cα-Cβsimplifications of constituent amino acid side chains. In cases wherethe searchable database comprises non-protein organic molecules (such asthe Cambridge Structural Database), entries may typically be representedaccording to charge surface or 3D coordinate of particular atoms orgroups of atoms (such as particular carbon, nitrogen and oxygen atoms,for example).

Suitably, a computer program is used for database searching.

Preferably, said computer program is the VECTRIX program, as describedin International Publication WO 00/23474.

It will therefore be appreciated that the database searching method ofthe invention is capable of identifying one or more molecules, orportions of said molecules, that mimic common protein surface shapes.These molecules may then be used to construct virtual or syntheticchemical libraries that have been focussed by selecting molecules thatare more likely to mimic the common protein surface shapes.

So that the invention may be more properly understood and put intopractical effect, the skilled person is directed to the followingnon-limiting examples.

EXAMPLES

1 The Clustering of β-Turns

1.1 Background

Protein structure comprises stretches of secondary structure (helices orβ-sheets) that are joined by turns, which enable a reversal in chaindirection. These turns are normally positioned on the surfaces ofproteins and allow the formation of the globular protein interior¹.β-turns²⁻⁴ are more common than the tighter coiled γ-turns and thelooser coiled α-turns and have been defined as four residue segments ofpolypeptides in which the distance between c_(αi) and c_(αi+3) is lessthan 7 Å, and that the central residues are not helical⁵. β-turnsencompass 25% of residues in proteins⁶, are important for protein andpeptide function^(2,7-9), and are an important driving force in proteinfoldinge^(2,10,11). Consequently, there have been numerous studies onthe design and development of β-turn mimetics^(7,12-22).

Despite the importance of side chain spatial arrangement in molecularrecognition, the conformations of β-turns are currently classified interms of the main chain dihedral angles, φ and ψ^(3,5,23-25). Althoughthis classification of β-turns has been extremely useful and has beenused widely to design peptidomimetics, it makes very little functionalsense. Each type of β-turn in the current classification can have two ormore clusters of side chain spatial arrangement, and different types ofβ-turns can have the same side chain spatial arrangement There have beentwo reports on the classification of β-turns based on the arrangementsof the side chains.²⁶ Whilst the β descriptor of Ball et al²⁷ definessome global structural characteristics of the turns, it is clearly anoversimplification, as it only considers two of the possible four sidechain positions and used only a small data set of 154 experimentallyderived β-turns. Garland and Dean^(28,29) have found common motifs forfour subsets of β-turns, c_(α) atom doublets, c_(α) atom triplets,c_(α)-c_(β) vectors doublets and c_(α)-c_(β) vectors triplets byhierarchical clustering the conformations of side chains. However, theclustering was not based on experimental data, but was based on allpossible permutations of selecting doublets or triplets out of each ofthe eleven idealized β-turn types. Furthermore, it is not necessary toexplicitly identify subsets of the β-turns for mimicry as have done byGarland and Dean^(28,29). This is because it is possible to derive thesesubsets from the whole β-turn using more sophisticated searchingalgorithms³⁰ then considered by Garland and Dean.

1.2 Method

1.2.1 β-Turns Clustering

1.2.1.1 Extraction of β-Turns from Protein Data Bank (PDB)

A high resolution and non-redundant database of β-turns are required forthe determination of common β-turn motifs that exist in proteins. Toensure high quality data, only high-resolution structures with aresolution of ≦2 Å and an R factor ≦20% were extracted from the 1997release of the Protein Data Bank³¹. Furthermore, to eliminate the biasedsampling in the PDB caused by the presence of multiple sutures that areminor variations of a particular protein chain, only non-homologousprotein chains with ≦25% homology with other protein chains were used.The distribution of the c_(α)1-c_(α4) distances of the resulting 3984four-residue segments that are not helical nor β-sheets is plotted inFIG. 1 and a major peak is observed at c_(α1)-c_(α4) distances of 5.5 Å.In 1973, Lewis²⁴ concluded that β-turns have c_(α1)-c_(α4) distance of≦7 Å based on the distribution of the c_(α1)-c_(α4) distances of onlyeight X-ray diffraction determined structures. To remove any possiblebiases caused by noisy data, the outliners of the major peak at 5.5 Åwere removed by eliminating the turns with c_(α1)-C_(α4) distance ofless than 5 Å or greater than 6.2 Å, resulting in 2675 β-turns in thedatabase.

1.2.1.2 Representation of Data

The present inventors motivations for clustering using c_(α)-c_(β)vectors are several fold. The c_(α)-c_(β) vector describes theinitiation of the side chain geometry, and is well definedexperimentally as it is anchored to the backbone. This is in contrast tothe more flexible penultimate side chain atoms. Importantly, mostmimetic strategies involve anchoring cs bonds to a non-peptidicscaffold, the extra atoms of the side chain providing a degree offlexibility in molecular recognition. We therefore consider thatclustering according to c_(α)-c_(β) vectors is functionally significant,when the aim is to use the motifs identified to design molecules thatmatch specific motifs.

Each of the 20 naturally-occurring amino acids, except for glycine,posses a c vector due to the covalent bond between the central a carbonand the 13 carbon of the side chain For 1-tuns that contains a glycine,the glycine residue was mutated to alanine to generate the requiredc_(α)-c_(β) vector. This was achieved by superimposing an ideal alaninestructure onto the n, c_(α) and c′ atoms of the glycine residue.

An important advance in database searching has been made by representing3D structures in terms of the relationship between atoms located indistance space, rather than Cartesian space^(30,32). A location indistance space is defined by distances between atoms, expressed in theform of a distance matrix. Distance matrices are therefore coordinateindependent, and comparisons between distance matrices can be madewithout restriction to a particular frame of reference, such as isrequired using Cartesian coordinates. It is important to emphasize thatan arrangement of atoms and its mirror image are described by identicaldistance matrices. A root mean squared deviation (RMSD) can be used toalleviate this ambiguity. The four c_(α)-c_(β) vectors of each β-turnare represented by a distance matrix rather than a Cartesian coordinatesystem. Since there are four pairs of distances between each pair ofc_(α)-c_(β) vectors (c_(α1)-c_(α2), c_(α1)-c_(β2), C_(β1)-c_(α2) andc_(β1)-c_(β2)) and there are six possible pairs of c_(α)-c_(β) vectors(1-2, 1-3, 1-4, 2-3, 2-4 and 3-4), then 24 distances are required torepresent the 3D topography of a β-turn. The distances betweenc_(αi)-c_(βi) were not included because these bonded distances arerelatively invariant between β-turns when compared to the non-bondeddistances used.

1.2.1.3 k^(th)-Nearest Neighbor

The k^(th)-nearest neighbor clustering algorithm^(33,34) employed herefor clustering of β-turns is basically a simple-linkage clusteringalgorithm³⁵ in which every member is initially assigned to a differentcluster and clusters are subsequently merged if the minimum distancebetween a member of a cluster and a member of another cluster is lessthan some threshold. The k^(th)-nearest neighbor clusteringalgorithm^(33,34) differs from simple-linkage clustering algorithm inthat the distance between members is replaced by a dissimilarity measuredefined below.

d_(k)(x) is defined as the Euclidian distance from observation x to the0 nearest observation v_(k)(x) is defined as the volume enclosed by thesphere, centering at observation x and having a radius of d_(k)(x). Thedensity at observation x, f(x), is defined as k/v_(k)(x)/N where N isthe total number of observations. The dissimilarity measure betweenobservations x_(i) and x_(j), D(x_(i), x_(j)), can be calculated fromthe following definitions. First x_(i) and x_(j) are said to be adjacentif the Euclidean distance between the two points is less thand_(k)(x_(i)) or d_(k)(x_(j)). If the observation x_(i) and observationx_(j) are not adjacent, then the dissimilarity measure, D(x_(i), x_(j)),is set to infinity. Otherwise, D(x_(i), x_(j)) is defined as the averageof the inverse of the density, i.e. D(x_(i),x_(j))=/(1/f(x_(i))+1/f(x_(j)))/2. Clustering should group togetherregions of high density separated by regions of low density.Effectively, by defining the dissimilarity measure as the inverse of thedensity, this algorithm first groups together adjacent points orclusters that have high-density.

The k^(th)-nearest neighbor algorithm from the commercially availableSAS/STAT program³⁶ was used to cluster the distance matricesrepresenting the topography of the coca vectors of the β-turns. Theoption ‘k’ is called the smoothing parameter. A small value of ‘k’produces jagged density estimates and large numbers of clusters, and alarge value of ‘k’ produces smooth density estimates and fewer clusters.A ‘k’ value of two was used because only a rough estimate of clusters isrequired here. The clusters obtained here are used as initial seeds forthe filtered nearest centroid-sorting algorithm described below.

1.2.1.4 Filtered Nearest Centroid Sorting Clustering Algorithm

The nearest centroid sorting clustering algorithm by Forgy^(37,38)requires a prior estimate of some initial seeds. The algorithm assignseach observation to the nearest initial seed to form temporary clusters.The seeds are then replaced by the means of the temporary clusters andthe process is repeated until no further changes occur in the clusters.After the k^(th) nearest neighbor clustering of the β-turns, a modifiedform of the nearest centroid sorting algorithm³⁷, filtered nearestcentroid sorting clustering algorithm was used to refine the clustering.This method superimposes observations in Cartesian coordinate space andthus removes the mirror image problem inherited from the distance matrixrepresentation in the k^(th) nearest mean clustering algorithm. Thereasons for this two-stage clustering process are: 1) Hierarchicalclustering based on RMSD could not be used in the first place becausethe number of observations is larger than the limit set by the SAS/STATprogram³⁶ and 2) Faster and leaner approximation methods, such asnearest centroid sorting³⁷⁻³⁸ or K-means clustering algorithm³⁹ couldnot be used without prior estimate of initial seeds.

The filtered nearest centroid sorting algorithm is basically the same asthe nearest centroid sorting algorithm except that if the minimum RMSDof a β-turn to the mean structures is above some definable threshold,then the turn is considered too remote and therefore not assigned to thetemporary clusters. In latter iterations, these unassigned turns aresuperimposed onto the new mean structures of the new temporary clustersand if the minimum RMSD is below the threshold, then they are assignedto the new temporary cluster. The aim of this filtering is to remove theturns that are very different from the mean structures, so as not tobias the mean Furthermore, 100% of the β-turns do not need to beclustered, only a major proportion is required. The filtered nearestcentroid-sorting algorithm was implemented in a C++ program entitled“fncsa_cluster-analysis.cpp”.

1.2.2 Cluster Analysis

1.2.2.1 Vector Plots of β-Turns

It is difficult to visualise the 24 distances that represent thetopography of a β-turn. A vector plot is used to aid the visualizationby approximating the 24 distances with four torsional angles θ1, θ2, θ3and θ4 (see FIG. 2). θ1 is defined as the torsional angle betweenc_(β1), c_(α1), c_(α2) and c_(β2). θ2 is defined as the torsional anglebetween c_(β2), c_(α2), c_(α3) and C_(β3); θ3 is defined as thetorsional angle between c_(β3), C_(α3), c_(α)4 and c_(β)4; and θ4 isdefined as the torsional angle between c_(β1), c_(α1), c_(α4) andc_(β4). Since the distances between adjacent c_(α) atoms in a β-turn arerelatively constant due to the nature of the peptide bond, the fourtorsional angles should represent the essential conformational featureof a β-turn. The four torsional angles are plotted as a vector from (θ1,θ2) (represented by the symbol ‘x’) to (θ3, θ4). Effectively, this plotapproximates the 24 distances of β-turns to four torsional angles (θ1,θ2, θ3 and θ4), which are plotted as a vector on a 2D graph. Thetorsional angles are periodic. A value of x is equivalent to x−360,x+360 and so on. To remove the graphing problem associated with theperiodic nature of the torsional angles, each torsional value istransformed into a period that is closest to the torsional angles of thefirst β-turn.

1.2.2.2 Visualizing β-Turn Clusters

Another method to visualize the clusters of β-turns is to superimposethe 3D structures of all the turns in a cluster. Superimposition isperformed from the four c_(α)-c_(β) vectors of a β-turn to the fourc_(α)-c_(β) vectors of the mean structure of the cluster. For glycine,the c_(α)-c_(β) vector is obtained by superimposing a standard alanineresidue to the n, c_(α) and c′ atoms of the backbone of the glycineresidue. The “fncsa_cluster_analysis.cpp” program outputs thecoordinates of the superimposed structures in a multi-structure pdb fileformat which is visualised using the program InsightII of MolecularSimulation Inc.

There are a few steps in the calculation of the mean of a cluster inCartesian coordinate space. Firstly, an initial mean structure for acluster is set to be the first β-turn that does not have glycine orproline residue. Then each β-turn is superimposed to this temporary meanstructure based on the coordinates of the c_(α)-c_(β) vectors. After thesuperimposition, a new temporary mean structure is computed by averagingthe x, y and z coordinates. The latter two steps are repeated untilsuccessive mean structures differ by less than some arbitrary threshold.

1.2.2.3 Calculation of the RMSD Matrix of all the Clusters

RMSD matrix is calculated to examine the performance of the clusteringby assessing the dissimilarity within and between clusters. Each clusteris compared with every other cluster so that the row and column numberof the matrix represents the cluster number. The value in a cell at rowx and column y represents the mean RMSD when all the β-turns in clusterx is superimposed to the mean structure of cluster y. The diagonal ofthe matrix with row x and column x represents intra-cluster RMSD whilethe other cells represents inter-cluster RMSD. Values in row x andcolumn y are not necessarily similar to values in row y and column xbecause the former represent the mean RMSD of the β-turns in cluster xsuperimposed onto the mean structure of cluster y and the latterrepresent the mean RMSD of the β-turns in cluster y superimposed ontothe mean structure of cluster x. However, the two numbers are verysimilar.

1.3 Results

1.3.1 Clustering

The k^(th) nearest neighbor cluster algorithm was used to cluster the2675 β-turns in the database. The mean structure (seed) of each of theoutputted 570 clusters was calculated by averaging each of the 24distances representing the topography of β-turns. In the second cycle,k^(th) nearest clustering was performed on these 570 seeds and 117 seedswere obtained. The third cycle of k^(th) nearest clustering produced 25seeds and the fourth cycle produced 7 seeds. Both the 7 and 25 seedswere examined in more detail prior to the selection of final β-turnclusters.

To determine a reasonable value for the threshold used in the filterednearest centroid sorting algorithm, the seven seeds obtained from thek^(th) nearest neighbor clustering were refined using filtered centroidsorting algorithm with four different threshold values, an RMSD of 0.6,0.65, 0.7 and infinity (no threshold at all). The results show that thelower the threshold, the higher the percentage of β-turns that arerejected (not assigned to a cluster). The percentages of rejection forthe four threshold values are 19%, 14%, 8% and 0% respectively. The RMSDmatrices of the results were calculated and the average of the meaninter-cluster RMSD are 1.05, 0.95, 1.03 and 1.15 respectively. Ideallyone would like clusters to differ as much as possible, and hence have ahigh inter-cluster RMSD. The mean inter-cluster RMSD was lowered ingoing from a threshold of infinity to 0.7 and to 0.65, however it gothigher in going from 0.65 to 0.6. The average of the mean intra-clusterRMSD are 0.36, 0.36, 0.40 and 0.44 respectively. In this instance, lowintra-cluster RMSD are favored, therefore emphasizing that theobservations in each cluster are similar. There were improvements in theintra-cluster RMSD in going from a threshold of infinity to 0.7 and from0.7 to 0.65. However there was no improvement in going from a thresholdof 0.65 to 0.6. As a compromise of the conflicting interest ofpercentage rejection, inter-cluster RMSD and intra-cluster RMSD, afilter threshold of 0.65 was chosen.

To determine if the 25 seeds from the third cycle or the 7 seeds fromthe fourth cycle of the k^(th) nearest neighbor clustering bestrepresent the side chain spatial arrangements of β-turns, both theresults were subjected to the filtered centroid sorting algorithmfollowed by the calculation of the RMSD matrix. The RMSD matrix for the7-clusters is shown in Table 1. The Clustering process aims to defineclusters that have low intra-luster RMSD separated by high inter-clusterRMSD. For the 25 clusters, the average of the mean intra-cluster RMSD is0.31, the average of the mean inter-cluster RMSD is 1.11 and the maximummean intra-cluster RMSD is 0.42. For the 7 clusters (Table 1), theaverage of the mean intra-cluster RMSD is 0.36, the average of the meaninter-cluster RMSD is 0.95 and the maximum mean intra-cluster RMSD isslightly higher, 0.49. The results show that the clustering into the 7clusters is not as good as the clustering into the 25 clusters, theintra-cluster RMSD was larger (0.36 compared to 0.31) and theinter-cluster was smaller (0.95 compared to 1.11). However, since thisis not a drastic difference and the 7-clusters still give reasonableintra-cluster RMSD, the more tractable 7-clusters result was preferredover the 25-cluster result.

1.3.2 Refinement of the Clustering

Vector graphs, as described in the method section, were used tovisualize the seven-cluster result (FIG. 3). The figure shows that allthe clusters except for cluster three have a reasonable uniformdistribution from a single mode. Cluster three, however, seems to havetwo modes. One mode with θ4≧70° and the other mode with θ4<70°.Furthermore, the RMSD matrix in Table 1 shows that cluster three has themost varied intra-cluster RMSD of 0.49. To determine if cluster threeshould remain as one cluster or should be divided into two clusters, apractical step was used in which cluster three was divided into twoclusters (one cluster with θ4≧70° and another cluster with θ4<70°) andthe new result assessed by comparison with the original result. Theresulting eight clusters were refined once more using the filterednearest centroid-sorting algorithm. The RMSD matrix and the vector plotfor the new eight clusters were calculated and the results are shown inTable 2 and FIG. 4 respectively. In dividing cluster three into twoclusters, the mean intra-cluster RMSD has not changed significantly(from 036 to 0.35), the maximum intra-cluster RMSD improved from 0.49 to0.43, the minimum intra-cluster RMSD improved from 0.31 to 0.29, themean inter-cluster has changed from 0.95 to 0.93 and finally thepercentage of turns represented by the clusters remained the same at86%. The vector plot in FIG. 4 shows that the β-turns in each clusterdistribute within a narrow range about a single mode. These resultssuggested that the eight clusters system is a better representation ofβ-turn motifs compared to the seven clusters system.

It was observed that type I′ β-turns were not included in any of theeight clusters, they were rejected in the filtered nearest centroidsorting clustering because their RMSD with the mean of the eightclusters were more than the threshold of 0.65. This reflects a weaknessin the k^(th) nearest mean algorithm as it does not identify a seed nearthe low frequency type I′ β-turns. Since FIG. 9 indicates that there isa cluster near the type I′ β-turns, the mean of the type I′ wascalculated and the result was included together with the other eightinitial seeds for the filtered nearest centroid sorting clustering. TheRMSD matrix and the vector plot for the new nine clusters werecalculated and the results are shown in Table 3 and FIG. 5 respectively.By the addition of the type I′ average structure into the initial seeds,the mean intra-cluster RMSD has not changed significantly (from 0.35 to0.36), the minimum and maximum intra-cluster RMSD remained the same, themean inter-cluster worsen (from 1.25 to 1.1) and the percentage ofβ-turns classified improved from 86% to 90%. The vector plots in FIG. 5shows that the β-turns in each cluster distribute within a narrow rangeabout a single mode. These results suggest that the nine clusters systemis a reasonable representation of β-turn motifs.

1.3.3 Mean Structures

The final nine-cluster result was also visualize by superimposing eachβ-turn in the clusters onto the clusters' mean structure (FIG. 6). Thevisual result is consistent with the mean intra-cluster RMSD value ofeach cluster in Table 3. The cluster with the least amount ofc_(α)-c_(β) vector spread (cluster 2, FIG. 6) corresponds to thesmallest mean intra-cluster RMSD. It is interesting to note that thebackbone structure can vary significantly although the c_(α)-c_(β)vectors are uniform within a cluster. A top view of the most uniformcluster, cluster 2 for example, shows that different backboneconformations can have similar c_(α)-c_(β) vector spatial arrangement(FIG. 7). In this instance, type I and type II β-turns are presentingthe same c_(α)-c_(β) vector spatial arrangement. To appreciate thedifference between the clusters, the mean sees of each cluster weresuperimposed based on the c_(α1), c_(α2) and c_(α3) atoms. The result ofthis superimposition is displayed in FIG. 8. The lowest inter-clusterRMSD (0.59) in FIG. 3 is between cluster 2 and cluster 4. The result inFIG. 8 also demonstrates that cluster 2 (red) and cluster 4 (green) aremost similar and furthermore, provides a visual aid to understanding themeaning of an inter-cluster RMSD value of 0.59. The highestinter-cluster RMSD (2.38) exists between cluster 6 and 9 (Table 3). Theresult in FIG. 8 also demonstrates that cluster 6 (dark blue) andcluster 9 (grey) differ significantly.

The mean members of each cluster then become a query in a databasesearching strategy. This is described in more detail in Section 4.

2 Clustering of Loops of Proteins

2.1 Background

Loops are defined as any continuous amino acid sequence that joinssecondary structural elements (helices and sheets). Consequently, loopsare a superset of β-turns since there is no restriction on c_(α1)-c_(α4)distances (as described above). Loops often play an important functionas exemplified by their roles in ligand binding⁴⁰, DNA-binding⁴¹,binding to protein toxin⁴², forming enzyme active sites⁴³, binding ofmetal ions⁴⁴, binding of antigens by immunoglobulins⁴⁵, binding ofmononucleotides⁴⁶ and binding of protein substrates by serineproteases⁴⁷. Identifying common loops motifs, then using these asqueries in virtual screening of virtual library strategies will providea novel and powerful strategy for the design and synthesis of bioactivemolecules.

2.2 Methods and Result

2.2.1 Extraction of Loops from Protein Data Bank

A database of loops was created by first extracting well refined(resolution of ≦2.0 Å and R-factor≦20%) and non-homologous (≦25%)protein chains^(48,49) from the 1999 release of the Protein Data Bank³¹.The program STRIDE⁵⁰ was then used to identify secondary structuralelements (helices and sheets) of these chains. The linking regions,defined as the remaining residues that link these secondary structuralelements or the protein terminus, were used for further analysis. Thelinking regions that consisted of four or more amino acid residues weredivided into four residue segments, resulting in a total of 23650four-residue loops. 319 of those loops were rejected because thedistance between backbone atoms n, c_(α) and c′ was not appropriate(≦0.8 or ≧2.0 Å). Each of the remaining 23331 loops was then simplifiedinto four c vectors (FIG. 2 a). Our motivations for clustering usingc_(α)-c_(β) vectors are several fold. The c_(α)-c_(β) vector describesthe initiation of the side chain geometry, and is well definedexperimentally as it is anchored to the backbone. This is in contrast tothe more flexible penultimate atoms in the side chain. Furthermore, mostmimetic strategies involve anchoring c_(α)-c_(β) bonds to a non-peptidicscaffold, the extra atoms of the side chain provides a degree offlexibility in molecular recognition. We therefore consider thatclustering according to c_(α)-c_(β) vectors is functionallysignificant⁵¹, especially when the identified common motifs are used todirect peptidomimetic development.

2.2.2 Systematic Identification of Highly Populated Conformations(Seeds)

The present inventors have identified the appropriate seed points inorder to cluster the loops of proteins. In this section the presentinventors describe a process for identifying these seeds. By comparingeach of the 23331 loops with all the other loops, all of the similarloops (“neighbors”) having a RMSD value of less than a constant,NEIGHBOR_LIMIT, were identified and counted for each loop. A plot of thenumber of neighboring loops versus its frequency (the number of timesthis number of neighbors was found) using various NEIGHBOR_LIMIT isshown in FIG. 10. For most values of NEIGHBOR_LIMIT, as the number ofneighboring loops increases, the frequency decreases and as the numberof neighboring loops decreases, the frequency increases. However, withlarge NEIGHBOR_LIMIT (0.6 and 0.7), the frequency maximum is locatedbetween 200 to 600 neighboring loops instead of locating near the lowernumber of neighboring loops. This means that for large NEIGHBOR_LIMIT,it is more frequent to have 200 to 600 neighbors rather than 0-100neighbors. The figure also shows the maximum number of neighboring loopsfor a particular NEIGHBOR_LIMIT, thereby giving an indication to thesimilarity between the loops. For example, the maximum number ofneighboring loops is 1763 with NEIGHBOR_LIMIT of 0.7, 526 withNEIGHBOR_LIMIT of 0.4 and 146 with NEIGHBOR_LIMIT of 0.25.

Now that the number of neighboring loops has been defined for each loop,the next step was to identify (given a specific loop and its neighbors)the loop that has the largest number of neighbors (a peak). That is, theloop is marked as a peak if all the neighboring loops have lower orequal number of neighboring loops. The number of peaks for variousNEIGHBOR_LIMIT is shown in FIG. 11. The figure shows that the number ofpeaks varied from 7818 to 13 as the NEIGHBOR_LIMIT varied from 0.25 to0.7. However, since peaks with the number of neighbors of less than anarbitrarily chosen low SEA_LEVEL of 20 are not interesting as they arenot significantly populated, the plot also shows the number of peakswith greater than the SEA_LEVEL number of neighbors. The number of peakswith greater than the SEA_LEVEL of 20 ranges from 56 to 3, decreasingwith increasing NEIGHBOR_LIMIT.

Since some peaks can be quite similar to each other, filtering wasperformed to identify a set of unique peaks to represent the data Twopeaks are considered not sufficiently unique if the fraction of sharedloops between the two peaks exceeds an OVERLAP_LIMIT value that was setto 20% of the total number of loops in the two peaks. From peaks withthe largest number of neighbors to peaks with the lowest number ofneighbors, the peaks were filtered out if they are not sufficientlyunique when compared with all the previous chosen unique peaks. Thisdefinition, which is based on the fraction of overlap, is morediscerning than a definition that is based on the RMSD distance betweenthe average structures of the peak. FIG. 11 also shows the number ofunique peaks as a function of NEIGHBOR_LIMIT. For NEIGHBOR_LIMIT of 0.4and above, all the peaks were unique. However, for NEIGHBOR_LIMIT ofless than 0.4, not all the peaks were unique.

As to the choice of the value for NEIGHBOR_LIMIT, the general principleis that the larger the NEIGHBOR_LIMIT value, the more the data is“generalize”. We seek to find the largest generation without the loss ofthe number of unique peaks with greater than 20 neighbors(SEA_LEVEL=20). FIG. 11 shows that the number of unique peaks withgreater than twenty neighbors ranges from 3 to 40, decreasing withincreasing NEIGHBOR_LIMIT value. When increasing NEIGHBOUR_LIMIT from025 to 0.3, FIG. 11 illustrates that there is no significant reductionin the number of unique peaks. This contrasts to every other increase inNEIGHBOUR_LIMIT. Consequently a NEIGHBOUR_LIMIT of 0.3 retains thenumber of unique peaks whilst increasing the “generalization” of thedata.

To examine the implication of the choice of 0.3 for NEIGHBOR_LIMIT, wedetermined whether the unique peaks obtained using a higherNEIGHBOR_LIMIT are a subset of those obtained using the lowerNEIGHBOR_LIMIT of 0.3. For example, if the unique peaks for theNEIGHBOUR_LIMIT of 0.4 are a subset (are similar to) the unique peaksfound for the NEIGHBOR_LIMIT of 0.3 then we would expect to findstructurally similar unique peaks in both datasets. To do this, wesystematically pair each of the unique peaks obtained using a higherNEIGHBOR_LIMIT with the most structurally similar unique peak obtainedusing a NEIGHBOR_LIMIT of 0.3. The results of pairing the 39 uniquepeaks obtained by using a NEIGHBOR_LIMIT of 0.3 with all the uniquepeaks obtained by using higher NEIGHBOR_LIMIT of 0.4, 0.5, 0.6 and 0.7are shown in Table 4. The RMSD between the pairs of unique peaks for allcomparisons ranged from 0.0 to 0.88. There are a significant number ofpeaks between datasets that have an RMSD of less then 0.3. Thisillustrates that in going to a higher NEIGHBOR_LIMIT similar uniquepeaks are found when compared to the unique peaks identified from the0.3 NEIGHBOR_LIMIT. This is particularly true for unique peaks with highfrequency. As you go down the rows of the table, the frequency of theunique peak decreases. As can be seen for low frequency unique peaks thestructural match between the datasets can sometimes be poor. We concludethat the choice of a NEIGHBOR_LIMIT of 0.3 is a reasonable compromisebetween “generalization” and accuracy (number of unique peaks) and theresulting unique motifs are used as seeds for further clustering.

2.2.3 Filtered Centroid Sorting Clustering

After the systematic identification of unique peaks with greater thantwenty neighbors, the filtered nearest centroid sorting algorithmdescribed in section 1.2.1.4 was utilized to refine the clustering. The39 unique peaks with greater than twenty neighbors were used as initialseeds for our filtered centroid sorting algorithm using variousTOLERANCE. The percentage of data clustered, the averageintracluster-RMSD, the average intercluster RMSD and the ratio of thelatter two are plotted as a function of the TOLERANCE in FIG. 12. Thefigure shows that as the TOLERANCE increases, the percentage of dataclustered and the intra-inter-RMSD ratio increases. At a TOLERANCE of0.3, the percentage of data clustered, the average intracluster RMSD,the average intercluster RMSD and ratio of the latter two are 12%, 0.22,0.95 and 0.11, respectively; at a TOLERANCE of 0.6 RMSD they are 71%,0.42, 1.94 and 0.21, respectively; and at a TOLERANCE of 0.9, they are100%, 0.51, 1.90 and 0.27, respectively. Therefore, there are opposingforces in the choice of tolerance. Higher tolerance is favored becauseof the greater percentage of data clustered, but is disfavored becauseof the higher intra to inter cluster RMSD ratio. Lower tolerance isfavored because of the lower intra to inter cluster RMSD ratio but isdisfavored because of the lower percentage of data clustered. From amimetics perspective, it is more important to have a reasonableintracluster similarity (intra cluster RMSD) than to have highpercentage clustered. To aid in the choice of intracluster RMSD andindirectly the choice of TOLERANCE, a plot of all the loops in clusterone formulated using various tolerances are shown in FIG. 13. The figureshows that with an average intracluster RMSD of 0.47, which correspondswith a TOLERANCE of 0.7 and 89% of the data clustered, the members inthe cluster are sufficiently similar for loop mimetic purpose. Thenumbers of loops in each of the 39 clusters obtained using a toleranceof 0.7 is plotted in FIG. 14. The least populated cluster has 307 loopsand the most populated cluster have 1048 loops.

2.2.4 Vector Plots of Loops

The vector graphs, as described in 1.2.2.1, for all the 39 clusters(FIG. 15) shows that the loops within a cluster have reasonably similarconformations.

2.2.5 Calculation of the RMSD Matrix of all the Clusters

An RMSD matrix is calculated to examine the performance of theclustering, by assessing the similarity within and dissimilarity betweenclusters (Section 1.2.2.3). The RMSD matrix, showing the average intra-and inter-clusters RMSD for all the 39 clusters, is shown in Table 5.The average intracluster RMSD of 0.47 and the vector graphs for all theclusters in FIG. 15 show that the loops within a cluster have similarconformations. The loops between clusters are dissimilar as the averageintercluster RMSD is 1.91.

2.2.6 Average Linkage Clustering Algorithm

Based on the above mentioned RMSD matrix, we applied the average linkageclustering algorithm^(36,52,53) to determine the structural relationshipbetween the 39 clusters. In the average linkage clusteringalgorithm,^(36,52,53) each structure is initially assigned to its owncluster of size one. Subsequently, clusters are merged if the averagedistance between all the structures in the two clusters fall within somethreshold. The resulting hierarchical tree, obtained by applying theaverage linkage-clustering algorithm on the 39 clusters, is shown inFIG. 16. All the cluster numbers used in this paper follow the orderfrom left to right of this hierarchical tree.

2.2.7 Distinct Clusters

Whilst each cluster contains a unique set of loops (no loops are in morethen one cluster), do the 39 clusters represent overlapping-variationfrom a continuous spread or do they represent distinct clusters that donot overlap in hyperspace? To answer this question, we defined that twoclusters are ‘distinct’ if the most frequent eighty-percent of the datain one cluster does not overlap with the most frequent eighty-percent ofthe data in the other cluster. The overlaps were computed for each ofthe following thirty-two descriptors of the loop conformation. Eachc_(α)-c_(β) vector pair has four distances (C_(α1)-c_(α2),C_(α1)-C_(β2), c_(β1)-c_(α2), c_(β2)-c_(β2)), so the six possiblec_(α)-c_(β) vector pairs (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) or the fourc_(α)-c_(β) vectors in a loop results in 24 distance descriptors.Furthermore, we also utilize all the possible six torsional angledescriptors between the four c_(α)-c_(β) vector and two torsionaldescriptors c_(α2)-c_(α2)-c_(α3)-c_(α4) and c_(β2)-c_(β2)-c_(β3)-c_(β4).Distinctions were made based on each of those thirty-two descriptors.

The maximum and minimum values which delineate the most frequent eightypercent of each distribution was not computed using the mean plus andminus some standard deviation because some of the spreads were notalways a ‘Normal’ distribution. The maximum and minimum was computed byfirst ‘binning’ the data with respect to each of the thirty-twodescriptors mentioned earlier. Then, the bins for each descriptor weresorted based on the frequency, from most frequent to least frequent. Asthe program traverses down the bins, the maximum and minimum of thedescriptor are stored. The traversal is stopped when the program hastraversed through at least eighty percent of the data. Consequently, thestored maximum and minimum represent the values that delineate the topeighty percentage of the distribution. This method of finding maximumand minimum works well with single peak distributions. For distributionswith two or more peaks, the spread covered by the maximum and minimumare over-estimated. In such case, the determination of distinct isunder-estimated.

For a particular descriptor, if the maximum and minimum of cluster Xoverlap with those of cluster Y, then cluster X and cluster Y isconsidered to be overlapping. On the other hand, if the maximum andminimum of cluster X does not overlap with the maximum and minimum ofcluster Y, then cluster X is considered to be distinct from cluster Y.This analysis shows that 737 out of a total of 741 combinations (99%) ofclusters are distinct. Since the analysis was based on individualdescriptors, those non-distinct clusters could be overlapping ordistinct if the analysis was extended to include combinations of thedescriptors.

These common protein loop motifs we have identified can now be used indatabase searching strategies to identify molecules that match the shapeof these motifs. This is described in more detail in section 5.

3 The Clustering of Protein Contact Surfaces

3.1 Background

The protein contact surfaces are comprised of a continous sequence ofamino acid residues as well as discontinuous sequences of amino acidresidues. In the previous two sections, we have clustered the continuousloops of protein binding sites. In this section, we describe theclustering of the side chains of protein contact surfaces.

3.2 Method

3.2.1 Definition of residues in protein-protein interfaces

At least four criteria have been used in the literature to defineresidues that are involved in protein-protein interfaces. Two residuesin two different chains are considered to be in the protein-proteininterface if (1) their c_(α) atoms are less than 9.0 Å apart or (2) anyatoms in one residue is within 5 Å of any atom of the other residue or(3) the distance between any atom of one residue to any atom within theother residue is less than the sum of their corresponding van der Waalsradii plus 0.5 Å and (4) the van der Waals energy between the residuesis less than −0.5 kcal/mol. The results are quite uniform between thefour criteria⁵⁴. Criterion three is used here. Furthermore, when thenumber of residues in an interface is less than 10, the interface isrejected because there is a high probability that this protein-proteininteraction is a result of crystal packing.

3.2.2 Non-Redundant Dataset

Tsai⁵⁴ have scanned 2814 PDB entries from the September 1994 release ofthe PDB database³¹ and found 1629 two-chain interface. Out of the 1629two-chain interfaces, Tsai et am have exacted 351 non-redundant familiesthrough the usage of structural comparison algorithm, measure ofsimilarity and clustering of the structures into families(http://protein3d.ncifcrf.gov/tsai/frame/dataset.html).

A c++ program, p-p_interface.cpp was written to extract the coordinatesof the c_(α)-c_(β) vectors of the residues in the protein-proteininterface. For Glycine residues, the coordinates of the c_(β) atom wasobtained by superimposing the n, c_(α) and c′ atoms of an ideal alanineonto the n, c_(α) and c′ atoms of the glycine. Non-Glycine residueswithout c_(β) atom coordinates were not included in the dataset. The 350non-redundant families gave rise to 700 protein chains that consist ofup to 150 residues that form contact in a single chain This results in avery large dataset. For example, to identify common motifs from groupsof 3-residues (this is the smallest size we would consider from amimetic perspective) there would be as many as 700×(₃ ¹⁵⁰)=385910000residue comparisons. Most clustering algorithms are too computationallyexpensive for such a large problem Consequently, we have developed newapproaches to tackle the clustering of protein contact surfaces.

3.2.3 Identification of Seeds for Clustering.

Identifying common structures in protein contact surfaces is asignificant challenge. The first step in determining the matchingfrequency of a group of residues is to develop a method to compare them.The large size of the database excludes the root mean squared deviation(RMSD) algorithm which is computationally expensive. A simpler methodcommonly used in chemical and biological situations is distance geometrycomparisons.

The simplicity of this distance geometry method allows for rapidgeometrical comparisons. However, distance geometry comparisons do notreturn a value of how closely related two geometries are (like an RMSDvalue) but instead return a match if their distances are the same withina certain tolerance. The geometric relationship between two residues canbe represented by four “bowtie” distances (FIG. 17). The tolerance, TOLrepresents the maximum allowed difference accepted between thesedistances to record a match. So two groups of residues {A,B} and {C,D}(FIG. 17) are matched within TOL if and only if|d(A _(H) ,B _(H))−d(C _(H) ,D _(H))|≦TOL and|d(A _(H) ,B _(T))−d(C _(H) ,D _(T))|≦TOL and|d(A _(T) ,B _(H))−d(C _(T) ,D _(H))|≦TOL and|d(A _(T) ,B _(T))−d(C _(T) ,D _(T))|≦TOLwhere d(x,y) is the Euclidean distance between the points x and y inthree dimensional space. Typically TOL∈[0.2,1.0]Å.

The extension to groups of more than 2 residues is simple. The generalnile is that a bowtie must be formed between each two-residuecombination within the group. A group of size N contains exactly (₂^(N)) bowties. To check to see if two groups of N residues are matched,every possible rotation ((₂ ^(N)) in all) of the groups (or everybowtie-bowtie comparison) can be considered. For example, two groups of3 residues {A,B,C} and {D,E,F} contain 6 possible residue matchups

So these two 3-motifs are matched if

where {A,B}*+{D,E} denotes that motifs {A,B}and {D,E} are equal withintolerance.3.2.4 Extracting Matching Frequency of Motifs

To extract the most common motifs from the dataset, each motif iscompared against all others. The set of all motifs that match aparticular motif, within a tolerance called family tolerance TOL), isreferred to as the family of that motif. The cardinality of that familyis the matching frequency for that motif. The common motifs with highmatching frequencies are those that make good candidates for seedpoints.

Pseudo-code for the generic algorithm for determining the matchingfrequency for each motif is given in FIG. 18. This algorithm generatesall motifs and their bowtie distances and then exhaustively compareseach against all in the data set.

Initially, 3-motifs were formed by simply considering every feasiblecombination of 3 residues within each chain However, the set of feasiblemotifs excludes ones that contain any bowtie distal greater than 25 Å,as we are only interested in common surface patches of this size.Although a very large data set is produced, it is still manageable forthis algorithm. The data set produced for 4 motifs formed this way istoo large. Using information from 3-motifs the size of the data set for4-motifs can be greatly reduced. Every 4-motif is formed by 4 3-motifs.For example, the group {A,B,C,D} is formed by the 4 groups {A,B,C},{A,B,D}, {A,C,D} and {B,C,D}. If any of these motif of 3 did notfrequently occur in the database then there is no chance of the motif{A,B,C,D} being a highly common motif either. This greatly reduces thesize of the data set and is essential for the construction of higherorder groups. The same method is used for higher order motifs.

3.2.5 Finding the Peak Matching Frequencies

Now that the number of matching frequency has been determined for eachmotif, the next step was to identify the motifs with ‘peak’ number ofmatching frequency. A motif is marked as a peak if all the other motifwithin the family of the motif have lower or equal matching frequency.The algorithm for searching for peaks in the data set is given in FIG.19. Initially the algorithm tags every motif as a peak. Subsequently,for every motif A, in the data set, if any motif within the family ofthe motif A have a lower matching frequency, it is tagged as non-peak.

3.2.6 Algorithms for Clustering Motifs

The objective of clustering methods, in this section, is to retrieve allrelated motifs about some peak motifs, with the proviso that not allmotifs in the data set need be clustered.

The simplest method for clustering motifs is a reverse scan of the dataset as for the original matching algorithm This method passes throughthe data set once accepting into the cluster every motif that ismatched. This procedure is very similar to the PAM non-hierarchicalmethod⁵⁵. This algorithm is called the one-pass algorithm and isoutlined in FIG. 20. The span of the one-pass algorithm for a landscapeof motifs is illustrated in FIG. 21. There is also an assumption thatall hills are symmetric about their peak motif. The tolerance, OTOL,assumes that the width of each hill is identical. The entire hill israrely collected because the range of the tolerance is constantthroughout the algorithm. There is a possibility that some motifscollected belong to a different hill.

An algorithm that in part overcomes these difficulties is the greedyalgorithm (FIG. 22). This algorithm is very similar to the singlelinkage hierarchical method. Each cluster is initialised as a seedpoint. Motifs are added to each cluster if they match any motif withinthe current cluster within tolerance. The algorithm moves down each peakuntil no other motifs in the data set match any motif within thecluster. This span is illustrated in FIG. 23. A flaw of the greedyalgorithm is that it continues to collect motifs until no others existthat match those in the cluster (within GTOL). There is a danger ofcollecting motifs that belong to another peak (if GTOL is too large) ornot collecting enough (if GTOL is too small). There is also no guaranteethat the distribution of motifs down each hill is consistent or even.There is also a possibility that the algorithm may not halt until everymotif in the database is collected if GTOL is overly large.

A simple method of overcoming this problem is to apply an additionaltolerance to the greedy algorithm. The combined one-pass and greedyalgorithm applies a one-pass tolerance, OTOL, to the greedy algorithm tolimit its span. It applies the additional constraint that every motif ineach cluster has to be within OTOL of the seed motif This algorithm isoutlined in FIG. 24.

A tolerance, ‘sealevel’, with respect to the matching frequency (ratherthan its geometry) of a motif is also effective in restricting the spanof the greedy algorithm. All motifs with matching frequency below thesea level are discarded from the data set. FIG. 25 shows how a sea levelis applied to the single linkage algorithm. An illustration of thepossible span of the algorithm is given in FIG. 26. However, selectingthe correct sealevel is not always easy. If set too low it will havelittle effect in restricting the span of the algorithm. If set too highsome peaks (and hills) will be excluded from the data set altogether. Asan illustration, the first peak from the left in FIG. 26 has beentotally excluded from the data set and the third peak has beenrestricted more so than the second. Unless all peaks are about the sameheight the application of a sea level will handicap some hills more thanothers.

To overcome this problem the sea level for each peak can be set ativelyas shown in FIG. 27. This is a more appropriate method for restrictingthe span of the algorithm Each sea level can be scaled according to thefrequency of the peak motif.

All methods discussed in this section can be altered for superimpositionof motifs onto the seed motif. FIG. 28 shows how this adjustment is madeto the PAM method. Although this new tolerance minimised the RMSD valuefor each

the peak for each seed. These algorithms are slightly more expensivethan their distance geometry counterparts.

3.2.7 Secondary Structure Analysis

The analysis of the secondary structure of each residue in the motifswas conducted using the DSSP (Dictionary of Protein Secondary Structure)software⁴. This software has been utilized widely throughout theliterate for the classification of protein shapes into formal secondarystructure. Given the 3D coordinates of residues within a protein, DSSPclassifies the shape based on its refined expert system There are 8different secondary structure classifications considered by DSSP. Theyare described in Table 14 together with the abbreviations that will beadopted in this paper. The classification ‘no assignment’ refers toshapes that do not fit any of the other classifications defined.

3.3 Clustering Results

3.3.1 Determining the Seed Points

The first step to determine the seed points for clustering was tocalculate the matching frequency of each 3-motif in the dataset This wasthe highest order of motif that was computationally feasible. Anexhaustive combination of 3-motifs produced 9,215,424 motifs as opposedto 197,712,949 4-motifs.

Three family tolerances (TOL) of 0.25, 0.5 and 0.75 Å were chosen basedon a number of sample calculations. 0.25 Å was the lowest tolerance thatproduced meaningful generalisations about structure in the dataset while0.75 Å was the highest tolerance that was computationally feasible,especially as the number of residues to be clustered increased.

Following the calculation of the matching frequencies for 3-motifs forall three TOL, the 4-motifs were constructed. As discussed previously inthe method section, these motifs were constructed based on the common3-motifs. At this point, we eliminated the uncommon motifs by removingall motifs that have a matching frequency below the matching frequencysealevel For example, all 3-motifs with less than 30 matching frequencywere excluded before 4-motifs were created. This level was selectedbased on trends seen in the dataset A matching frequency sealevel of 30is insignificant when compared against the highest matching frequency(434 for TOL 0.25 Å). We also tested a matching frequency sea levels of20 when forming 4-motifs from the 3-motifs, and found that lowering thematching frequency sealevel had no effect on the frequency of the seedpoints. Given the sealevels of 30 for 3-motif, 5 for 4-motif, 0 forhigher order motifs, the highest matching frequency for each tolerancefor each N-motif (3≦N≦7) is given in FIG. 30. FIG. 29 shows the numberof N-motifs created. The number of motifs for larger tolerances isgreater because the matching frequencies are higher and hence less ofthe dataset is excluded.

Following the calculation of the matching frequency for all motifswithin the dataset, peak motif geometries can be extracted from thedataset (FIG. 19). From the dataset of peak motifs, the 30 motifs withthe largest matching frequency were selected to be the seed points forthe clustering stage. This value is kept constant for all motif sizesand tolerances. Although chosen arbitrarily, it almost always includesall unique motifs that have up to half the matching frequency of thehighest value for that dataset. At the same time, there are not too manyseed points as to get significant overlapping in the clusters. When only30 seed points are selected, this overlapping only occurs when largetolerances are adopted for the clustering algorithms. In addition, aplot of a histogram of the dataset reveals that significant amount ofthe original data is covered by the 30 most common unique motifs.

3.3.2 The Clusters

After the determination of the seed motifs, two clustering methods werefinally adopted: (1) one-pass algorithm (FIG. 20) using the sametolerance as the initial family tolerance, and with no sealevel applied.(2) Greedy algorithm with adaptive sealevel proportional to the peak (orseed) matching frequency (FIG. 27). A number of different greedytolerances (GTOL) were applied in an attempt to achieve as large a rangeas possible for the tightness of the clusters. Different adaptivesealevels of 0.125, 0.25, 0.5 and 0.75 of the frequency of the seed wastrailed during clustering.

The success of each clustering algorithm is determined by three piecesof information: the size of the clusters, the intracluster RMSD and theintercluster RMSD. The aim is to cluster as many motifs as possible,though the resulting clusters should contain motifs that are similar(minimise intracluster RMSD), though each cluster must differ as much aspossible (increase intercluster RMSD).

Summary table for the 4-motif, 5-motif, 6-motif and 7-motif are given isgiven in the Table 6, 8, 10 and 12, respectively. The three differentfamily tolerances considered, 0.25, 0.5 and 0.7, are given in the secondcolumn of each summary table. Information about the clustering forfamily tolerance 0.25 Å are not presented for 6-motifs or larger becausethe dataset was too small to extract meaningful clusters. The summarytables give the sum of the size of the clusters and the sum of thenumber of unique motifs within those clusters. The difference betweenthe sum of the size of the clusters and the sum of the unique motifs inclusters give the number of motifs that occur in more than one cluster.The intracluster and intercluster RMSD are the average for eachalgorithm.

Table 7, 9, 11 and 13 contain the intracluster RMSD, the average RMSD ofeach motif in the cluster superimposed onto the average motif for thatcluster (along the main diagonal), and the intercluster RMSD, theaverage RMSD of all motifs against the mean motif of other cluster(entries off the main diagonal). At the top of tables, the size of eachcluster is presented.

The representative clusters for each N-motif is selected based on asimple criteria: select the clustering technique producing the largestamount of data, while having small intracluster RMSD and largeintercluster RMSD. A small intracluster RMSD is less than 0.5 Å and alarge intercluster RMSD is greater than 2.0 Å. These values weredetermined based on a visualisation of the resulting clusters. Inaddition these clusters should be relatively distinct, that is, not toomany motifs occurring in more that one cluster.

The summary of results for the clustering of 4-motifs is given in Table6. The parameters producing the largest set of clusters were the greedymethod, family tolerance 0.75 Å, algorithm tolerance 0.5 Å and sealevel0.125 of the seed matching frequency. However the average intraclusterRMSD of 0.67 Å for this method is far too high. The next largest span ofthe dataset was achieved by the same method with sealevel 0.25 of theseed matching frequency. This method produced clusters with averageintracluster RMSD of 0.51 which is much more acceptable. So the clustersproduced by this method were selected as representative of the datasetfor 4-motifs. The specific information about the selected 4-motifclusters is given in Table 7. The RMSD values on the main diagonal aremuch smaller than other values in the table. As described previously inSection 2.2.3, we anticipate using the filtered-centroid sortingalgorithm to further refine the clusters identified. A picture of one ofthese clusters (C29) is seen in FIG. 31.

The summary of results for 5-motifs is given in Table 8. The selectedrepresentative clusters in this case were the greedy algorithm, familytolerance 0.75, greedy tolerance 0.7 Å and sealevel 0.125 of the seedmatching frequency. The specific information about the selected clustersfor 5-residue motifs in given in Table 9. An example of a cluster of5-residue motifs (C30) is given in FIG. 32.

Average clustering results for 6-motifs are presented in Table 10. Theselected method in this case was the greedy algorithm, family tolerance0.75 Å, greedy tolerance 0.7 Å and sealevel 0.125 times the peakmatching frequency. Specific cluster information about the methods arepresented in Table 11. A representative cluster for the selectedclustering algorithm C1 is given in FIG. 33.

The summary of results for 7-motifs is given in Table 12. The selectedrepresentative clusters in this case are the greedy algorithm, familytolerance 0.75 Å, greedy tolerance 0.9 Å and sealevel 0.125 of the seedmatching frequency. Information about the specific clusters is presentedin Table 13. An example of a visualised cluster of 7-motifs (C10) isgiven in FIG. 34.

3.3.3 Secondary Structure of the Clusters

An analysis of the secondary structure of the seed of each cluster wasundertaken as described in the method section. The results show that allthe residues of each seed were classified as α-helix except for the seedof cluster C2 of the 4-motifs where the four residues were classified asextended β-strand. There was a possibility, however, that this secondarystructure classification of the seeds may not agree with theclassification of the average motif of each cluster. Table 15, Table 16,Table 17 and Table 18 give the distribution of secondary structurethroughout each of the 4, 5, 6, and 7-residue motif cluster,respectively. The results in these tables confirm that the secondarystructures of the seeds are almost always consistent with the secondarystructures of the motifs within the cluster. The only possible exceptionto this is cluster C2 for the 4 motifs, where only 56% of motifs haveshared secondary structure with the seed. Most of the others (allα-helical) have over 90% in agreement with the seed, for all sizes ofmotifs.

Even if all the residues in the seeds or clusters are α-helical, theseeds or clusters do not necessarily belong to a single α-helix becausethe residues flanking between the residues in the motif may not beα-helical. Due to possible uncertainty with the DSSP classification, anα-helix is considered broken when flanked by two or more consecutive nonα-helical residues. Table 19 records the proportion of motifs in eachcluster that are not single α-helix Except for cluster C2 of the4-motifs (of which 99% are non-helical), almost all other motifs aresingle α-helix.

3.3.4 Non Single α-Helical Clusters

Given that almost all the above mentioned clusters are part of a singleα-helix, we proceeded with extracting non-single α-helix clusters. Thefirst step was to find the highest matching frequency seeds that werenot single α-helix. The secondary structure of these seeds is given inTable 20. The results show that as the size of the motif increases, thesecondary suture within each motif were more uniform.

Table 21 contains a summary of the clusters retrieved using a variety ofmethods with the new seeds. The most successful clustering methods ofthe previous analysis were adopted as starting points for this analysis.These results show that there is a strong trend towards clustersbecoming more distinct as the size of the motif increases. For 4-residuemotifs, there is significant sharing of motifs between clusters withmany having quite a large intracluster RMSD. However, as the size of themotifs become larger, significantly larger tolerances can be adoptedwithout altering the composition of the resulting clusters. The greedytolerance could be extended to 1.1 Å for 6-residue motifs with theintracluster RMSD remaining very low. This is in contrast to the highesttolerance of 0.7 Å that could be adopted for the original set of seeds.

The criteria for the selection of representative clusters for each sizeof motif was the same as for the previous clustering study: to selectthe largest clusters possible, while keeping the intracluster RMSD low(less than approximately 0.5 Å) and the intercluster RMSD high (greaterthan 2.0 Å). RMSD comparisons between these resulting clusters for4-residue motifs, is presented in Table 22.

Table 23 presents the distribution of different types of secondarystructure for the new 4-residue motif clusters. Even though the seeds ofeach of these clusters are classified as ‘not a single α-helix’, a largeproportion of motifs within three clusters are α-helix. Table 24presents the proportion of motifs in each of the new clusters that arenot a single α-helix. Less than 20% of motifs of cluster C5, C17 and C25are not a single α-helix and more than 90% of residues of these clustersare classified as α-helix. Clusters that have a high proportion ofmotifs with no assignment to formal secondary structure, such as C2, C3,C4, C7, C10, C12 and C16 have relatively small size (the largest hasjust 26 members) and relatively low intracluster RMSD. Within thisgroup, the average intracluster RMSD is 0.28 Å as opposed to the averageintracluster RMSD of 0.52 Å for the entire set.

This relationship remains consistent for larger sizes of motifs too.Table 25 presents the distribution of secondary structureclassifications for 7-residue motifs. Only clusters C1, C2 and C3 havemotifs whose component residues are not entirely classified ‘noassignment’. These three clusters have average intracluster RMSD of 0.41Å which is much larger than the average intracluster RMSD for the entireset of 0.26 Å. This correlation suggests these clusters that containmotifs with no formal secondary structure assignment, have shape that ishighly unique to any other motifs in the dataset.

4 Clustering of Surface Patches

The basic algorithm for the clustering of surface patches is similar tothat for clustering discontinuous protein surfaces. Firstly, snapshotsof the protein surface, called patch motifs, is generated, A patch motifcontaining N grid points is referred to as an N-patch. The smallestpatch size that will be considered in this study is the 3-patch Thealgorithm for the construction of 3-patches is given in FIG. 36.

FIG. 35 describe the algorithm for determining the matching frequency ofeach patch motif Again, an RMSD superimposition of these motifs is toocomputationally expensive because of the size and the number of patches.A distance matrix is constructed so that the shape of the patches can beeasily compared. This distance matrix is calculated for an N-patch bycreating a complete graph K_(N) whose vertices are the grid points andedges are weighted by the Euclidian distance between each pair ofvertices. An illusion of a complete graph K₉ constructed for a 9-patchis given in FIG. 38.

A total of N! comparisons are required to determine if two N-patches areequal in shape. Every vertex-vertex match-up needs to be considered Thisis a very expensive calculation computationally. If N=4, for example, 24orientations with 144 edge distance comparisons, need to be attempted inorder to determine if two patches do not have matching geometricstructure. If N=9, the number of orientations becomes 362880 with 13 063680 distance comparisons! A number of improvements need to be made tomake this problem feasible.

A number of quick and simple comparisons can be made between pairs ofpatches before an exhaustive check should take place. This is to improvethe computational feasibility of the problem. The first is to comparethe charges of each grid point before comparing distances. If thecharges don't match within the charge tolerance, there is no need tocheck if the edge distances match. This is a much easier calculation.Another is to check if the longest and shortest edge distances of thepair of patches match If they don't then the geometric structure of thepatches is different. Finally, there may be some distances of thedistance matrix that remain constant for all patches because of the waythe grid was originally constructed. This may remove the need toconsider certain orientations of patches when being compared.

The N+1-patches will be constructed based on N-patches in similar mannerto the higher order motif build-up procedure for the discontinuoussurfaces described earlier. Patches that are less than a definedmatching frequency sealevel will be removed from the dataset Thisreduces the number of redundant higher order patches that will becreated. The algorithm for the creation of higher order patches is givenin FIG. 37. In this algorithm, each new N-patch is created from(N-1)(N-1)-patches.

Once the matching frequency is determined, seeds and clusters can beobtained as described in the previous sections.

5. Scaffolds

As described above, the present inventors have clustered the side chainpositions of β-turns, loops and protein contact surfaces. This hasresulted in the identification of 9, 39 and 240 highly populated motifsfor β-turns, loops and protein contact surfaces, respectively. As anexample, the coordinate of the 5^(th) least popular cluster of all thesemotifs are given in Table 26. These motifs define common spatialelements of protein surfaces. Our objective is to use these motifs todesign libraries of molecules. Consequently, these motifs are used asbiological descriptors in library design, and the resulting librarieswill mimic common protein shapes. In high throughput screening, suchlibraries will be a valuable resource for the development of new leadcompounds.

To this end, the present inventors have used a subset of the motifs andhave screened the virtual library of molecules derived from theCambridge Structural Database to identify molecules that match thespatial elements of the motifs. Our in house virtual-screening ofvirtual-library program, VECTRIX, was used to search the database. FIG.39 a shows some of the scaffolds that match the β-turn conformations,FIG. 39 b shows some of the scaffolds that match the common loopconformations, and FIG. 39 c shows a scaffold that match a commonsix-residues protein-protein interaction surface.

As illustrated in FIG. 39, molecules are identified that match the shapeof the common motifs. This information will lead to the design ofmolecules that match common protein shapes.

Throughout this specification, the aim has been to describe thepreferred embodiments of the invention without limiting the invention toany one embodiment or specific collection of features. Various changesand modifications may be made to the embodiments described andillustrated herein without departing from the broad spirit and scope ofthe invention.

All computer programs, algorithms, patent and scientific literaturereferred to in this specification are incorporated herein by referencein their entirety.

REFERENCES

-   1) Rose, G. D.; Young, W. B. Nature 1983, 304, 654-657.-   2) Rose, G. D.; Gierasch, L. M.; Smith, J. A. Adv. Protein Chem.    1985, 37, 1-109.-   3) Wilmot, C. M.; Thornton, J. M. Protein engineering 1990, 3(6),    479-493.-   4) Kabsch, W.; Sander, C. Biopolymers 1983, 22, 2577-2637.-   5) Richardson, J. S. Adv. Protein Chem. 1981, 34, 167-339.-   6) Wilmot, C. M.; Thornton, J. M. J. Mol. Biol. 1988, 203, 221-232.-   7) Freidinger, R. M.; Veber, D. F.; Perlow, D. S.; Brooks, J. R.    Science 1980, 210, 656-658.-   8) Li, S. Z.; Lee, J. H.; Lee, W.; Yoon, C. J.; Baik, J. H.;    Lim S. K. Eur. J. Biochem 1999, 265, 430-440.-   9) Andrianov, A. M. Molecular Biology 1999, 33, 534-538.-   10) Smith, J. A.; Pease, L. G. CRC Crit. Rev. Biochem. 1980, 8,    315-399.-   11) Mutter, M. TIBS 1988, 13, 260-265.-   12) Ball, J. B.; Alewood, P. F. Journal of Molecular Recognition    1990, 3, 55-56.-   13) Tran, T. T.; Treutlein, H. R.; Burgess, A. W. J. Comput. Chem.    2001, 22, 1010-1025.-   14) Douglas, A. J.; Mulholland, G.; Walker, B.; Guthrie, D. J. S.;    Elmore, D. T.; Murphy, R. F. Biochem. Soc. Trans. 1988, 16, 175-176.-   15) Li, W.; Burgess, K. Tetrahedron Lett. 1999, 40, 6527-6530.-   16) Halab, L.; Lubell, W. D. Journal of organic chemistry 1999, 64,    3312-3321.-   17) Terret N. Drug Discovery Today 1999, 4, 141-141.-   18) Rosenquist, S.; Souers, A. J.; Virgilio, A. A.; Schurer, S. S.;    Ellman, J. A. Abstracts of papers of the American chemical society    1999 1999, 217, 212-   19) Gardner, R. R.; Liang, G. B.; Gellman, S. H. J. Am. Chem. Soc.    1999, 121, 1806-1816.-   20) Mer, G.; Kellenberger, E.; Lefevre, J. F. J. Mol. Biol. 1998,    281, 235-240.-   21) Lombardi, A.; D'Auria, G.; Maglio, O.; Nasti, F.; Quartara, L.;    Pedone, C.; Pavone, V. J. Am. Chem. Soc. 1998, 120, 5879-5886.-   22) Fink, B. E.; Kym, P. R; Katzenellenbogen, J. A. J. Am. Chem.    Soc. 1998, 120, 4334-4344.-   23) Venkatachalam, C. M. Biopolymers 1968, 6, 1425-1436.-   24) Lewis, P. N.; Momany, F. A.; Scheraga, H. A. Biochim Biophys    Acta 1973, 303, 211-229.-   25) Hutchinson, E. G.; Thornton, J. M. Protein Science 1994, 3,    22-7-2216.-   26) Ball, J. B.; Hughes, R. A.; Alewood, P. F.; Andrews, P. R    Tetrahedron 1993, 49, 3467-3478.-   27) Ball, J. B.; Andrews, P. R.; Alewood, P. F.; Hughes, R. A.    Federation of European Biochemical Societies 1990, 273, 15-16.-   28) Garland, S. L.; Dean, P. M. J. Comput.-Aided Mol. Design 1999,    13, 469-483.-   29) Garland, S. L.; Dean, P. M. J. Comput.-Aided Mol. Design 1999,    13, 485-498.-   30) Ho, C. M.; Marshall, G. R. J. Comput.-Aided Mol. Design 1993,    185, 3-22.-   31) Berstein, F. C.; Koetzle, T. F.; Williams, G. J. B.; Edgar, F.;    Meyer, J.; Brice, M. D.; Kennard, O.; Shimanouchi, T.; Tasumi, M. J.    Mol. Biol. 1977, 112, 535-542.-   32) Jakes, S. E.; Willett, P. Journal of Molecular Graphics 1986, 4,    12-   33) Wong, M. A.; Lane, T. J.R. Statist. Soc. B 1983, 45, 362-368.-   34) Wong, A.; Lane, T. J.R. Statist. Soc. B 1983, 45, 362-368.-   35) Tsai, C.-J.; Lin, S. L.; Wolfson, H. J.; Nussinov, R. Protein    Science 1997, 6, 53-44.-   36) Anonymous SAS/STAT User's guide, Volume 1, ANOVA-FREQ, Version    6; 1999;-   37) Anderberg, M. R. Cluster analysis for applications; Academic    Press: New York and London, 1973;-   38) Forgy, E. W. Biometrics 1965, 21, 768-   39) MacQueen, J. B. Proc. Symp. Math. Statist. and Probability 1967,    1, 281-297.-   40) Joseph, D.; Petsko, G. A.; Karplus, M. Science 1990, 249,    1425-1428.-   41) Jones, S.; van Heyningen, P.; Berman, H. M.; Thornton, J. M. J.    Mol. Biol. 1999, 287, 877-896.-   42) Wu, S. J.; Dean, D. H. J. Mol. Biol. 1996, 255, 628-640.-   43) Wlodawer, A.; Miller, M.; Jakolski, M.; Sathyanarayana, B. K.;    Baldwin, E.; Weber, I. T.; Selk, L. M.; Clawson, L.; Schneider, J.;    Kent, S. B. H. Science 1989, 245, 616-621.-   44) Lu, Y.; Valentine, J. S. Curr. Opin. Struct. Biol. 1997, 7,    495-500.-   45) Bajorath, J.; Sheriff, S. Proteins 2001, 24, 152-157.-   46) Kinoshita, K.; Sadanami, K; Kidera, A.; Gõ, N. Protein    engineering 1999, 12, 11-14.-   47) Perona, J. J.; Craik, C. S. Protein Science 1995, 4, 337-360.-   48) Hobohm, U.; Scharf, M; Schneider, R; Sander, C. Protein Science    1992, 1, 409-417.-   49) Hobohm, U.; Sander, C. Protein Science 1994, 3, 522-   50) Frishman, D.; Argos, P. Proteins: structure, function and    genetics 1995, 23, 566-579.-   51) Damewood, J. R. In Lipkowitz, K.. B., Boyd, D. B., Eds.; VCH    Publishers: New York, 1996; pp 1-79.-   52) Sokal, R. R.; Michener, C. D. University of Kansas Science    Bulletin 1958, 38, 1409-1438.-   53) Manly, B. F. J Multivariate statistical method, A primer;    Chapman & Hall: London, 1994;-   54) Tsai, C.-J.; Lin, S. L.; Wolfson, H. J.; Nussinov, R. J. Mol.    Biol. 1996, 260, 604-620.-   55) Kaufman, L.; Rousseeuw, P. J., John Wiley and Sons Publisher:    New York, 1990.

56) Lauri & Bartlett J. Comp. Aid Mol. Des. 1994, 8 5 TABLE 1 Cluster 12 3 4 5 6 7 1 0.32 0.70 0.78 1.10 0.71 0.84 1.39 2 0.70 0.31 0.65 0.600.63 1.09 0.93 3 0.85 0.75 0.49 0.86 0.79 0.99 1.10 4 1.12 0.64 0.810.38 0.82 1.47 0.67 5 0.72 0.66 0.72 0.81 0.35 1.15 0.89 6 0.84 1.090.93 1.46 1.14 0.31 1.69 7 1.40 0.95 1.05 0.67 0.90 1.70 0.38

TABLE 2 Cluster 1 2 3 4 5 6 7 8 1 0.32 0.70 0.87 1.09 0.71 0.84 1.390.80 2 0.69 0.29 0.70 0.59 0.64 1.09 0.94 0.71 3 0.89 0.74 0.39 0.810.84 1.06 1.06 0.75 4 1.11 0.62 0.80 0.36 0.80 1.47 0.66 0.87 5 0.710.66 0.82 0.80 0.34 1.16 0.89 0.72 6 0.84 1.09 1.04 1.46 1.15 0.31 1.690.94 7 1.40 0.96 1.05 0.67 0.90 1.70 0.38 1.09 8 0.84 0.77 0.77 0.890.76 0.97 1.10 0.43

TABLE 3 Cluster 1 2 3 4 5 6 7 8 9 1 0.32 0.70 0.86 1.09 0.71 0.84 1.390.79 2.14 2 0.69 0.29 0.70 0.59 0.64 1.09 0.93 0.71 1.67 3 0.88 0.740.39 0.81 0.84 1.06 1.06 0.76 1.61 4 1.10 0.62 0.80 0.36 0.81 1.47 0.670.87 1.24 5 0.71 0.66 0.82 0.80 0.34 1.16 0.88 0.72 1.76 6 0.84 1.091.04 1.46 1.15 0.30 1.69 0.93 2.37 7 1.40 0.96 1.05 0.67 0.89 1.70 0.381.09 1.09 8 0.83 0.77 0.77 0.90 0.76 0.96 1.10 0.43 1.77 9 2.15 1.691.61 1.24 1.77 2.38 1.10 1.77 0.43

TABLE 4 0.3¹ 0.4² RMSD³ 0.3¹ 0.5² RMSD³ 0.3¹ 0.6² RMSD³ 0.3¹ 0.7² RMSD³4 1 0.11 4 1 0.15 2 1 0.39 9 1 0.51 2 2 0.42 2 2 0.34 4 2 0.26 4 2 0.225 3 0.15 5 3 0.17 16 3 0.21 16 3 0.21 3 4 0.30 12 4 0.64 12 4 0.64 Aver0.31 6 5 0.15 4 5 0.27 16 5 0.00 12 6 0.64 3 6 0.30 37 6 0.57 11 7 0.2913 7 0.21 Aver 0.34 4 8 0.12 14 8 0.22 10 9 0.00 10 9 0.14 14 10 0.26 110 0.12 13 11 0.36 31 11 0.21 17 12 0.21 37 12 0.60 27 13 0.06 37 130.20 31 14 0.21 37 14 0.72 29 15 0.15 35 15 0.88 37 16 0.42 Aver 0.34 3717 0.45 19 18 0.71 19 19 0.80 2 20 0.80 16 21 0.78 Aver 0.35

TABLE 5 1 2 3 4 5 6 7 8 9 10 11 12 13  1 0.39 0.75 0.81 0.79 1.09 1.251.38 0.88 0.85 1.06 1.25 1.40 1.85  2 0.49 0.80 0.82 0.95 0.90 1.18 0.841.02 1.00 0.93 0.97 1.45  3 0.42 0.75 0.88 1.02 0.83 1.26 1.32 1.23 1.361.18 1.61  4 0.41 0.72 0.89 0.95 1.18 1.14 0.90 1.32 1.28 1.58  5 0.440.74 0.87 1.22 1.28 0.93 1.14 1.03 1.17  6 0.47 1.01 1.22 1.33 0.88 1.070.95 0.96  7 0.52 1.70 1.76 1.46 1.57 1.16 1.44  8 0.37 0.76 0.94 0.901.34 1.62  9 0.49 0.84 1.05 1.48 1.72 10 0.50 1.08 1.34 1.34 11 0.450.82 1.11 12 0.50 0.95 13 0.53 14 15 16 17 18 19 20 21 22 23 24 25 26 2728 29 30 31 32 33 34 35 36 37 38 39 14 15 16 17 18 19 20 21 22 23 24 2526  1 2.08 2.67 2.80 3.07 3.19 3.10 2.69 3.37 3.02 2.93 3.22 3.05 3.07 2 1.85 2.31 2.48 2.73 2.91 2.77 2.34 3.06 2.79 2.71 2.97 3.09 3.01  31.48 2.21 2.32 2.73 2.79 2.70 2.27 3.07 2.58 2.59 2.87 3.22 3.03  4 1.762.39 2.48 2.85 2.94 2.80 2.40 3.16 2.76 2.88 2.98 3.14 3.00  5 1.54 2.022.11 2.45 2.58 2.39 1.99 2.74 2.41 2.34 2.62 2.99 2.77  6 1.63 1.93 2.092.36 2.53 2.27 1.83 2.61 2.39 2.27 2.57 2.92 2.70  7 1.05 1.81 1.89 2.382.40 2.33 1.95 2.78 2.21 2.28 2.49 3.01 2.85  8 2.33 2.63 2.74 2.88 3.062.94 2.55 3.06 2.98 2.78 3.01 2.64 2.75  9 2.43 2.66 2.80 2.86 2.99 2.942.57 3.02 2.92 2.72 2.92 2.47 2.57 10 2.16 2.39 2.50 2.71 2.83 2.63 2.222.83 2.69 2.45 2.76 2.58 2.51 11 2.08 2.11 2.32 2.34 2.64 2.46 2.11 2.652.67 2.54 2.73 2.75 2.77 12 1.52 1.65 1.90 2.03 2.28 2.14 1.79 2.48 2.292.29 2.41 2.88 2.77 13 1.69 1.45 1.68 1.74 2.03 1.68 1.30 1.97 2.02 1.882.09 2.56 2.35 14 0.56 1.32 1.31 1.90 1.78 1.88 1.62 2.35 1.54 1.80 1.892.43 2.36 15 0.43 0.81 0.79 0.88 0.89 0.88 1.34 1.10 1.20 1.17 1.67 1.6716 0.48 1.03 0.92 1.02 1.05 1.41 0.88 0.95 0.83 1.43 1.39 17 0.43 0.710.73 1.05 0.92 1.19 1.23 0.97 1.26 1.39 18 0.44 0.80 1.17 1.09 0.87 1.060.77 1.03 1.20 19 0.44 0.76 0.78 1.04 0.95 0.95 1.33 1.22 20 0.49 1.091.17 0.94 1.26 1.74 1.48 21 0.49 1.32 1.09 1.10 1.21 1.04 22 0.41 0.840.66 1.23 1.10 23 0.51 0.88 1.25 0.88 24 0.46 0.83 0.88 25 0.48 0.81 260.47 27 28 29 30 31 32 33 34 35 36 37 38 39 27 28 29 30 31 32 33 34 3536 37 38 39  1 3.29 1.99 1.66 1.66 2.20 1.59 1.65 2.51 2.65 2.50 2.833.24 3.18  2 3.14 1.84 1.65 1.39 2.14 1.45 1.74 2.57 2.65 2.43 2.78 3.103.12  3 3.12 2.23 1.95 1.76 2.54 1.92 2.09 2.63 2.88 2.78 3.09 3.20 3.38 4 3.18 1.97 1.62 1.54 2.39 1.77 1.89 2.70 2.68 2.53 2.98 3.23 3.21  52.87 1.81 1.60 1.30 2.24 1.65 1.89 2.68 2.59 2.39 2.74 2.88 3.02  6 2.761.67 1.52 1.11 2.20 1.59 1.86 2.58 2.49 2.28 2.73 2.84 2.93  7 2.83 2.382.15 1.84 2.77 2.20 2.43 3.14 2.99 2.93 3.10 2.91 3.27  8 2.95 1.45 1.321.22 1.63 1.03 1.25 2.07 2.26 1.96 2.31 2.79 2.58  9 2.81 1.40 1.08 1.261.57 1.03 1.01 1.87 2.03 1.86 2.20 2.69 2.55 10 2.75 1.27 0.95 0.95 1.771.22 1.30 2.08 2.02 1.83 2.38 2.83 2.67 11 2.87 1.33 1.39 0.90 1.55 0.911.39 2.18 2.32 1.92 2.16 2.65 2.59 12 2.69 1.84 1.85 1.28 2.11 1.52 1.952.67 2.74 2.42 2.51 2.52 2.87 13 2.28 1.55 1.66 0.99 2.01 1.56 1.93 2.512.41 2.11 2.32 2.23 2.49 14 2.29 2.84 2.70 2.33 3.00 2.67 2.93 2.96 2.863.18 2.84 2.42 2.84 15 1.42 2.57 2.61 2.16 2.42 2.50 2.71 2.24 2.22 2.501.93 1.41 1.90 16 1.35 2.65 2.61 2.32 2.49 2.66 2.75 2.08 2.00 2.45 2.001.50 1.90 17 1.07 2.41 2.62 2.30 1.95 2.40 2.41 1.76 1.85 1.99 1.35 0.621.35 18 0.87 2.53 2.58 2.60 2.02 2.61 2.41 1.61 1.66 1.98 1.45 0.92 1.3819 0.90 2.50 2.58 2.29 2.25 2.65 2.64 1.89 1.78 2.08 1.62 1.07 1.36 201.29 2.30 2.33 1.94 2.54 2.44 2.52 2.30 2.01 2.37 2.07 1.55 1.77 21 0.812.24 2.41 2.40 1.98 2.55 2.38 1.63 1.45 1.68 1.34 0.92 0.93 22 1.05 2.662.62 2.61 2.37 2.83 2.58 1.81 1.65 2.15 1.89 1.44 1.66 23 0.98 2.39 2.242.33 2.45 2.67 2.46 1.81 1.46 2.03 1.94 1.48 1.53 24 0.90 2.44 2.39 2.592.08 2.64 2.30 1.48 1.44 1.90 1.55 1.11 1.39 25 0.85 1.92 1.91 2.45 1.652.13 1.72 0.87 0.98 1.37 1.13 0.95 1.05 26 0.78 1.99 1.91 2.40 1.90 2.381.98 1.21 0.88 1.48 1.49 1.25 1.08 27 0.49 2.11 2.17 2.55 1.83 2.44 2.131.30 1.14 1.50 1.27 0.69 0.90 28 0.43 0.80 0.80 0.95 0.67 0.99 1.38 1.370.83 1.44 1.97 1.60 29 0.50 0.99 1.26 1.05 0.85 1.37 1.29 1.13 1.74 2.221.85 30 0.47 1.38 0.93 1.27 1.90 1.88 1.44 1.93 2.39 2.18 31 0.46 0.870.89 1.01 1.39 0.89 0.89 1.43 1.38 32 0.46 0.83 1.50 1.76 1.30 1.55 2.091.96 33 0.51 1.08 1.38 1.16 1.43 1.93 1.78 34 0.52 0.85 0.97 0.95 1.221.11 35 0.51 0.94 1.28 1.42 1.02 36 0.47 0.99 1.42 0.88 37 0.49 0.780.86 38 0.46 0.86 39 0.52

TABLE 6 Method Family TOL (Å) TOL (Å) Sealevel (% peak)$\sum\limits_{i = 1}^{30}{C_{i}}$${Unique}\quad{\sum\limits_{i = 1}^{30}{C_{i}}}$ Avg. IntraclusterRMSD Avg. Intercluster RMSD O 0.25 0.25 0 744 699 0.16 2.10 O 0.5 0.5 04862 4862 0.21 2.26 O 0.75 0.75 0 8902 8899 0.27 2.37 G 0.25 0.2 0.1252427 1748 0.16 2.10 G 0.25 0.2 0.25 1905 1407 0.16 2.10 G 0.25 0.2 0.5812 626 0.14 2.10 G 0.25 0.2 0.75 223 166 0.23 2.11 G 0.25 0.3 0.1253801 2687 0.18 2.11 G 0.25 0.3 0.25 2498 1825 0.17 2.11 G 0.25 0.3 0.5954 709 0.16 2.10 G 0.25 0.3 0.75 281 217 0.22 2.10 G 0.25 0.5 0.1253802 2688 0.18 2.11 G 0.25 0.5 0.25 2498 1825 0.17 2.11 G 0.25 0.5 0.5956 711 0.16 2.10 G 0.25 0.5 0.75 295 225 0.21 2.10 G 0.5 0.2 0.125 20792079 0.19 2.26 G 0.5 0.2 0.25 2073 2073 0.19 2.26 G 0.5 0.2 0.5 18531853 0.18 2.26 G 0.5 0.2 0.75 1047 1047 0.17 2.26 G 0.5 0.3 0.125 76547654 0.23 2.26 G 0.5 0.3 0.25 6715 6715 0.22 2.26 G 0.5 0.3 0.5 40714071 0.19 2.26 G 0.5 0.3 0.75 1540 1540 0.16 2.26 G 0.5 0.5 0.125 101358991 0.29 2.26 G 0.5 0.5 0.25 7413 7013 0.26 2.26 G 0.5 0.5 0.5 40884088 0.19 2.26 G 0.5 0.5 0.75 1540 1540 0.16 2.26 G 0.75 0.2 0.125 18481848 0.21 2.37 G 0.75 0.2 0.25 1848 1848 0.21 2.37 G 0.75 0.2 0.5 18461846 0.21 2.37 G 0.75 0.2 0.75 1646 1646 0.21 2.37 G 0.75 0.3 0.125 80408040 0.25 2.37 G 0.75 0.3 0.25 7973 7973 0.25 2.37 G 0.75 0.3 0.5 71497149 0.24 2.37 G 0.75 0.3 0.75 4215 4215 0.21 2.37 G 0.75 0.5 0.12533367 14969 0.67 2.32 G 0.75 0.5 0.25 19813 11479 0.51 2.34 G 0.75 0.50.5 9975 8506 0.31 2.37 G 0.75 0.5 0.75 4726 4598 0.23 2.38

TABLE 7 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Size ofCluster 616 682 278 258 609 1809 256 261 300 274 266 423 303 1750 310304 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Size ofCluster 476 1263 345 340 944 1667 1648 386 882 428 855 473 545 861 Q1 Q2Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q1 0.69 2.56 3.72 3.09 0.69 2.32 3.14 3.97 1.182.74 Q2 2.38 0.01 1.83 1.78 2.58 2.54 1.79 2.23 2.78 1.99 Q3 3.77 1.910.27 1.58 3.77 3.98 1.58 1.29 4.01 2.47 Q4 3.06 1.71 1.59 0.24 3.06 3.161.05 1.47 3.42 1.78 Q5 0.70 2.56 3.71 3.06 0.71 2.33 3.11 3.94 1.15 2.72Q6 2.31 2.45 3.97 3.24 2.31 1.13 3.23 4.11 2.31 2.80 Q7 3.08 1.69 1.581.04 3.08 3.12 0.27 1.47 3.30 1.75 Q8 3.97 2.26 1.30 1.51 3.97 4.11 1.490.22 4.23 2.33 Q9 1.30 2.81 4.00 3.43 1.30 2.37 3.30 4.20 0.32 2.77 Q102.69 1.88 2.45 1.77 2.69 2.64 1.73 2.20 2.77 0.25 Q11 3.99 2.26 1.231.56 3.99 4.10 1.49 0.96 4.20 2.33 Q12 2.40 1.75 2.33 1.74 2.45 2.611.65 2.20 2.80 1.07 Q13 2.88 2.36 2.79 2.10 2.88 2.12 2.29 2.97 3.042.11 Q14 2.38 2.58 4.07 3.06 2.38 1.10 2.98 4.14 2.43 2.77 Q15 3.04 2.342.71 2.33 3.04 2.11 2.10 3.01 2.89 1.89 Q16 3.26 2.56 3.25 2.80 3.262.52 2.76 3.57 3.41 2.19 Q17 2.71 2.94 4.13 3.17 2.71 1.49 3.30 4.152.41 2.64 Q18 2.50 1.14 1.80 1.39 2.50 2.83 1.29 2.12 2.76 1.97 Q19 1.831.50 2.69 1.96 1.83 2.12 1.74 2.75 2.19 1.71 Q20 2.03 1.52 2.66 1.722.03 2.18 2.01 2.78 1.91 1.59 Q21 2.42 1.14 1.81 1.46 2.42 2.78 1.382.20 2.76 1.92 Q22 2.10 2.47 3.98 3.07 2.19 1.09 3.01 4.23 2.34 2.78 Q232.30 2.47 3.97 3.08 2.30 1.12 3.05 4.24 2.27 2.89 Q24 2.83 1.75 2.581.85 2.83 2.52 1.95 2.34 2.78 1.06 Q25 2.44 1.71 2.65 2.23 2.44 2.262.21 2.62 2.57 1.58 Q26 3.88 2.17 1.09 1.42 3.88 3.94 1.48 1.26 4.162.11 Q27 2.37 1.67 2.62 2.28 2.37 2.23 2.20 2.66 2.52 1.01 Q28 2.14 1.002.80 2.42 2.14 2.09 2.30 2.88 2.13 1.47 Q29 3.01 1.14 1.29 1.55 3.013.02 1.01 1.78 3.29 2.16 Q30 3.42 1.12 1.80 1.50 2.42 2.77 1.37 2.212.72 1.95 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q1 4.00 2.53 2.90 2.323.07 3.22 2.82 2.45 1.81 1.98 Q2 2.23 1.78 2.34 2.54 2.33 2.53 2.98 1.111.45 1.47 Q3 1.22 2.36 2.79 3.98 2.71 3.26 4.10 1.86 2.72 2.68 Q4 1.561.84 2.11 3.16 2.33 2.81 3.21 1.49 1.97 1.73 Q5 3.96 2.51 2.90 2.33 3.073.20 2.80 2.44 1.77 1.95 Q6 4.14 2.74 2.28 1.11 2.19 2.53 1.52 2.76 2.122.14 Q7 1.47 1.72 2.29 3.12 2.10 2.76 3.34 1.41 1.75 2.01 Q8 0.96 2.362.98 4.12 3.02 3.61 4.19 2.19 2.76 2.84 Q9 4.14 2.82 3.03 2.37 2.89 3.392.45 2.77 2.18 1.90 Q10 2.28 1.18 2.13 2.04 1.89 2.20 2.66 1.98 1.721.59 Q11 0.23 2.34 3.05 4.11 2.96 3.61 4.32 2.18 2.85 2.82 Q12 2.27 0.301.89 2.61 2.11 2.17 2.76 1.83 1.36 1.60 Q13 3.00 1.86 0.28 2.12 1.072.26 1.85 3.64 2.22 2.24 Q14 4.11 2.67 1.92 1.11 1.89 2.01 1.30 2.851.96 2.09 Q15 2.95 2.02 1.07 2.11 0.28 2.23 1.89 2.63 2.28 2.22 Q16 3.582.22 2.28 2.52 2.24 0.29 2.72 2.88 2.56 2.59 Q17 4.25 2.79 1.80 1.481.84 2.75 0.54 3.19 2.40 2.17 Q18 3.12 1.74 2.71 2.83 2.70 2.87 3.300.87 1.43 1.44 Q19 2.82 1.43 2.22 2.12 2.25 2.53 2.43 1.47 0.30 1.04 Q202.81 1.06 2.22 2.18 2.21 2.58 2.22 1.46 1.05 0.28 Q21 2.17 1.85 2.642.78 2.72 2.89 3.36 0.69 1.37 1.58 Q22 4.20 2.01 2.21 1.08 2.21 2.321.71 2.72 1.90 2.12 Q23 4.24 2.71 2.25 1.12 2.25 2.30 1.67 2.75 2.102.03 Q24 3.41 1.61 1.94 2.53 1.77 2.39 2.49 2.06 1.79 1.56 Q25 2.62 1.341.76 2.26 1.98 2.67 2.38 2.03 1.47 1.68 Q26 1.21 2.16 2.90 3.04 3.022.07 4.20 1.93 2.80 2.73 Q27 3.04 1.38 1.78 2.23 1.99 2.50 2.38 1.081.39 1.62 Q28 3.02 1.61 2.00 2.09 1.90 2.28 2.25 1.96 1.21 1.35 Q29 1.751.94 2.52 3.02 2.51 2.57 3.53 1.32 1.97 2.08 Q30 2.17 1.85 2.02 2.772.69 2.89 3.34 0.68 1.40 1.57 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q12.41 2.31 2.31 2.84 2.37 2.86 2.37 2.12 2.96 2.40 Q2 1.07 2.54 2.54 1.752.21 2.21 1.73 1.31 1.01 1.06 Q3 1.86 3.98 3.99 2.59 2.67 1.08 2.67 2.821.29 1.85 Q4 1.55 3.16 3.16 1.85 2.21 1.40 2.21 2.41 1.55 1.55 Q5 2.392.32 2.32 2.81 2.33 3.84 2.33 2.08 2.96 2.39 Q6 2.74 1.11 1.11 2.44 2.284.08 2.28 1.97 2.90 2.73 Q7 1.44 3.12 3.12 1.99 2.16 1.45 2.10 2.30 1.611.44 Q8 2.25 4.12 4.12 2.39 2.73 1.28 2.78 2.89 1.79 2.25 Q9 2.73 2.362.86 2.75 2.53 4.14 2.53 2.12 3.27 2.73 Q10 1.93 2.64 2.65 1.06 1.602.09 1.59 1.47 2.13 1.99 Q11 2.23 4.11 4.11 2.46 4.66 1.21 2.66 3.071.76 2.23 Q12 1.88 2.61 2.62 1.42 1.23 2.09 1.22 1.53 2.07 1.89 Q13 2.632.13 3.13 1.93 1.83 2.97 1.82 2.00 2.50 2.63 Q14 2.83 1.09 1.09 2.562.22 4.00 2.23 2.07 3.14 2.83 Q15 2.61 2.11 2.11 1.74 2.01 2.99 2.011.89 2.50 2.61 Q16 2.86 2.51 2.51 2.39 2.36 3.08 2.36 2.28 2.57 2.88 Q173.17 1.48 1.49 2.45 2.40 4.18 2.40 2.20 3.51 3.17 Q18 0.85 2.83 2.832.10 2.01 1.91 2.00 1.87 1.19 0.86 Q19 1.46 2.11 2.11 1.76 1.47 2.751.47 1.20 1.96 1.46 Q20 1.43 2.18 2.18 1.56 1.62 2.70 1.62 1.35 2.051.42 Q21 0.60 2.78 2.78 2.16 2.01 1.89 2.01 1.99 1.25 0.56 Q22 2.68 1.071.07 2.72 2.24 3.92 2.25 2.09 2.93 2.07 Q23 3.71 1.11 1.11 2.66 2.343.93 2.34 1.97 2.89 2.71 Q24 2.07 2.53 2.53 0.26 1.36 2.27 1.36 1.251.99 2.08 Q25 2.02 2.27 2.27 1.20 0.71 2.72 6.68 1.17 1.94 2.02 Q26 1.953.94 3.94 2.29 2.69 0.22 2.69 2.98 1.71 1.95 Q27 1.08 2.24 2.24 1.270.68 2.79 0.70 1.12 1.89 1.98 Q28 1.98 2.09 2.09 1.24 1.28 2.96 1.270.28 1.91 1.98 Q29 1.84 3.02 3.02 1.99 1.99 1.70 1.99 1.91 0.28 1.38 Q300.53 2.77 2.77 2.19 2.03 1.88 2.03 2.01 1.25 0.52

TABLE 8 Method Family TOL (Å) TOL (Å) Sealevel (% peak)$\sum\limits_{i = 1}^{30}{C_{i}}$${Unique}\quad{\sum\limits_{i = 1}^{30}{C_{i}}}$ Avg. IntraclusterRMSD Avg. Intercluster RMSD O 0.25 0.25 0 163 160 0.31 2.46 O 0.5 0.5 02033 2033 0.23 2.58 O 0.75 0.75 0 4698 4698 0.23 2.59 G 0.25 0.2 0.12588 88 0.35 2.48 G 0.25 0.2 0.25 87 87 0.35 2.48 G 0.25 0.2 0.5 78 780.35 2.48 G 0.25 0.2 0.75 57 57 0.20 2.48 G 0.25 0.3 0.125 475 322 0.232.45 G 0.25 0.3 0.25 438 307 0.23 2.45 G 0.25 0.3 0.5 289 223 0.25 2.46G 0.25 0.3 0.75 131 114 0.28 2.46 G 0.25 0.5 0.125 498 342 0.23 2.45 G0.25 0.5 0.25 453 319 0.23 2.45 G 0.25 0.5 0.5 309 237 0.24 2.45 G 0.250.5 0.75 162 135 0.28 2.46 G 0.5 0.2 0.125 69 69 0.23 2.59 G 0.5 0.20.25 69 69 0.23 2.59 G 0.5 0.2 0.5 68 68 0.23 2.59 G 0.5 0.2 0.75 57 570.22 2.59 G 0.5 0.3 0.125 1970 1970 0.22 2.58 G 0.5 0.3 0.25 1846 18460.22 2.58 G 0.5 0.3 0.5 1252 1252 0.21 2.58 G 0.5 0.3 0.75 458 458 0.272.58 G 0.5 0.5 0.125 4258 4258 0.27 2.58 G 0.5 0.5 0.25 3016 3016 0.242.58 G 0.5 0.5 0.5 1506 1506 0.21 2.58 G 0.5 0.5 0.75 525 525 0.22 2.58G 0.75 0.2 0.125 45 45 0.15 2.58 G 0.75 0.2 0.25 45 45 0.15 2.58 G 0.750.2 0.5 45 45 0.15 2.58 G 0.75 0.2 0.75 45 45 0.15 2.58 G 0.75 0.3 0.1251864 1861 0.24 2.59 G 0.75 0.3 0.25 1863 1863 0.24 2.59 G 0.75 0.3 0.51806 1806 0.24 2.59 G 0.75 0.3 0.75 1205 1205 0.23 2.58 G 0.75 0.5 0.1256092 5427 0.38 2.57 G 0.75 0.5 0.35 3660 3660 0.34 2.58 G 0.75 0.5 0.51607 1607 0.27 2.59 G 0.75 0.3 0.76 1349 1349 0.23 2.59 G 0.75 0.7 0.12510467 7852 0.56 2.55 G 0.75 0.7 0.25 7799 6026 0.46 2.55 G 0.75 0.7 0.53690 3690 0.27 2.59 G 0.75 0.7 0.75 1607 1607 0.23 2.59

TABLE 9 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Size ofCluster 298 286 821 309 245 174 147 168 178 290 191 587 305 281 185 182Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Size of Cluster270 185 519 196 308 772 716 232 702 289 650 491 284 298 Q1 Q2 Q3 Q4 Q5Q6 Q7 Q8 Q9 Q10 Q1 0.60 2.02 2.04 1.47 2.67 1.96 3.14 2.08 2.18 2.71 Q22.00 0.66 2.46 2.44 3.26 2.30 4.07 2.48 1.15 2.48 Q3 1.94 2.43 0.90 1.752.41 2.43 3.03 2.23 2.25 2.65 Q4 1.44 2.42 1.83 0.78 2.70 2.17 3.03 1.982.32 2.67 Q5 2.70 3.37 2.44 2.80 0.57 3.80 4.29 3.71 3.42 2.82 Q6 2.032.25 2.47 2.25 3.59 0.30 2.85 1.00 2.52 3.60 Q7 3.20 4.06 3.01 4.11 4.322.86 0.26 2.84 4.15 4.93 Q8 2.15 2.47 2.28 2.07 3.70 1.08 2.83 0.30 2.363.64 Q9 2.27 1.27 2.31 2.39 3.36 2.53 4.16 2.36 0.34 2.44 Q10 2.05 2.462.67 2.75 2.80 3.72 4.93 3.72 2.40 0.71 Q11 2.78 3.29 2.25 2.05 3.072.15 1.85 2.35 3.17 3.95 Q12 2.69 2.60 2.09 2.65 2.69 3.74 4.86 3.732.45 1.26 Q13 2.16 2.39 2.31 2.22 2.62 2.67 3.56 2.69 2.29 2.54 Q14 2.152.32 2.12 2.04 2.71 2.66 3.57 2.07 2.42 2.45 Q15 2.78 3.16 2.08 2.503.29 3.38 1.83 2.12 3.32 3.93 Q16 3.25 4.02 3.00 3.28 4.44 2.48 1.322.64 4.03 4.91 Q17 2.13 2.18 2.24 1.91 3.03 2.54 3.94 2.43 2.10 2.28 Q183.28 3.94 2.91 3.23 4.50 2.66 1.29 2.44 4.07 4.91 Q19 2.19 2.80 2.052.52 3.87 2.25 2.25 2.06 2.65 3.52 Q20 2.55 2.88 2.55 2.72 4.12 1.971.39 2.12 3.15 3.99 Q21 2.17 2.09 2.40 2.14 2.95 2.40 3.93 2.49 2.212.08 Q22 1.98 2.61 0.95 1.81 2.32 2.48 2.91 2.28 2.37 2.61 Q23 1.84 2.421.36 2.01 2.10 2.30 2.90 2.46 2.54 2.75 Q24 2.74 2.92 2.52 2.85 4.092.36 1.91 2.45 3.10 4.12 Q25 1.80 2.34 1.38 2.00 2.10 2.10 2.87 2.332.45 2.74 Q26 2.87 3.07 2.50 2.88 3.95 2.46 1.91 2.39 2.91 4.10 Q27 2.032.54 0.94 1.83 2.28 2.45 2.94 2.24 1.28 2.56 Q28 2.49 2.80 2.05 2.523.87 2.25 2.25 2.00 2.65 3.52 Q29 2.35 2.63 2.53 2.37 3.83 1.81 2.441.81 1.70 3.47 Q30 2.24 2.53 2.11 2.56 3.02 2.09 2.43 2.32

.07 3.44 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q1 2.72 2.59 2.24 2.242.80 3.19 2.25 3.22 2.45 2.49 Q2 3.34 2.68 2.52 2.37 3.21 4.01 2.14 3.962.79 2.83 Q3 2.10 2.74 2.35 2.15 2.03 2.99 2.18 2.86 1.94 2.44 Q4 2.582.63 2.16 1.98 2.42 3.25 1.97 3.16 2.48 2.67 Q5 3.13 2.67 2.63 2.71 3.204.41 3.10 4.48 3.73 4.09 Q6 2.15 3.54 2.56 2.60 2.38 2.47 2.56 2.66 2.251.97 Q7 1.80 4.74 3.01 3.62 1.84 1.33 3.96 1.29 2.24 1.89 Q8 2.36 3.542.62 2.57 2.13 2.64 2.45 2.44 2.03 2.11 Q9 3.17 2.57 2.36 2.47 3.32 4.052.15 4.09 2.71 3.15 Q10 3.95 1.44 2.01 2.44 3.01 4.90 2.28 4.91 3.513.99 Q11 0.30 3.93 2.81 2.93 1.07 1.79 3.24 2.01 2.18 2.32 Q12 3.97 1.102.47 2.42 3.93 4.91 2.04 4.85 3.60 3.87 Q13 2.84 2.28 0.64 1.20 2.963.74 1.82 3.66 2.07 2.73 Q14 2.91 2.36 1.21 0.61 2.82 3.66 1.59 3.742.67 2.90 Q15 1.08 3.96 2.96 2.80 0.28 2.04 3.11 1.77 2.37 2.17 Q16 1.794.83 3.70 3.70 2.06 0.27 3.98 1.06 2.14 2.01 Q17 3.24 2.25 1.77 1.533.10 3.97 0.58 4.00 2.69 2.68 Q18 2.02 4.82 3.71 3.79 1.78 1.07 4.010.26 2.06 1.88 Q19 2.29 3.63 2.70 2.69 2.45 2.24 2.68 2.16 0.70 1.28 Q202.31 3.96 2.79 2.94 2.18 2.00 2.72 1.87 1.41 0.28 Q21 3.11 2.33 1.571.75 3.18 4.02 1.17 3.95 2.51 2.85 Q22 2.21 2.73 2.36 2.16 2.06 2.932.38 2.76 2.01 2.58 Q23 2.02 2.63 2.16 2.26 2.21 2.79 2.50 2.89 2.182.44 Q24 1.71 4.08 2.95 2.76 1.35 1.89 3.14 1.98 2.05 2.07 Q25 1.94 2.642.14 2.24 2.14 2.79 2.39 2.90 2.22 2.53 Q26 1.36 4.09 2.77 2.90 1.621.99 3.20 1.94 1.96 2.02 Q27 2.24 2.69 2.32 2.11 2.04 2.90 2.30 2.822.04 2.62 Q28 2.29 3.64 2.70 2.09 2.45 2.24 2.67 2.15 0.70 1.27 Q29 2.333.25 2.04 2.01 2.32 2.54 2.29 2.52 2.05 1.88 Q30 2.32 3.39 2.60 2.483.20 2.07 2.42 2.32 1.81 1.32 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q12.23 2.03 1.83 2.79 1.83 2.92 2.01 2.45 2.26 2.17 Q2 2.08 2.46 2.28 2.862.27 3.03 2.43 2.80 2.56 2.45 Q3 2.36 0.98 1.37 2.38 1.30 2.53 0.95 1.952.03 1.90 Q4 2.20 1.83 2.02 2.87 2.02 2.90 1.80 2.48 2.26 2.69 Q5 3.032.43 2.21 4.09 2.21 3.94 2.37 3.73 3.78 3.59 Q6 2.44 2.47 2.27 2.35 2.272.45 2.51 2.25 1.77 2.08 Q7 3.95 3.01 2.99 1.92 2.99 1.90 3.02 2.24 2.452.45 Q8 2.53 2.29 2.45 2.44 2.44 2.38 2.33 2.03 1.76 2.29 Q9 2.22 2.312.48 3.09 2.48 2.90 2.29 2.71 2.69 2.67 Q10 2.06 2.66 2.73 4.07 2.734.14 2.61 3.52 3.49 3.89 Q11 1.13 2.25 2.06 1.71 2.06 1.35 2.26 2.182.33 2.32 Q12 2.21 2.68 2.68 4.15 2.68 4.09 2.64 3.60 3.38 3.27 Q13 1.632.31 2.12 2.88 2.12 2.72 2.27 2.67 1.96 2.62 Q14 1.79 2.12 2.24 2.702.24 2.88 2.07 2.67 1.94 2.50 Q15 3.21 2.08 2.25 1.35 2.25 1.62 2.052.37 2.53 2.20 Q16 4.04 3.07 2.03 1.89 2.92 2.02 3.09 2.13 2.55 2.07 Q171.18 2.23 2.37 3.10 2.37 3.22 2.21 2.69 2.23 2.37 Q18 3.97 2.91 3.021.99 3.02 1.92 2.93 2.05 2.53 2.82 Q19 2.49 2.05 2.24 2.04 2.24 1.892.09 0.70 2.04 1.18 Q20 2.89 2.56 2.47 2.08 2.47 2.00 2.60 1.40 1.881.33 Q21 0.57 2.39 2.26 3.26 2.26 3.11 2.37 2.52 2.22 2.58 Q22 2.52 0.961.35 2.57 1.35 2.59 0.91 2.02 2.47 1.99 Q23 2.43 1.36 0.93 2.61 0.922.59 1.33 2.19 2.46 1.79 Q24 3.26 2.52 2.51 0.27 2.50 1.08 2.54 2.062.20 1.85 Q25 2.33 1.38 0.94 2.65 0.95 2.61 1.32 2.23 2.50 1.80 Q26 3.152.50 2.53 1.08 2.53 0.28 2.54 1.97 2.20 2.04 Q27 2.44 0.98 1.86 2.612.35 2.68 0.87 2.04 2.51 1.98 Q28 2.49 2.05 2.24 2.04 2.24 1.89 2.090.70 2.04 1.18 Q29 2.29 2.53 2.53 2.20 2.53 2.21 2.54 2.06 0.30 2.14 Q302.63 2.11 1.95 1.86 1.95 2.05 2.11 1.31 2.14 0.31

TABLE 10 Method Family TOL (Å) TOL (Å) Sealevel (% peak)$\sum\limits_{i = 1}^{30}{C_{i}}$${Unique}\quad{\sum\limits_{i = 1}^{30}{C_{i}}}$ Avg. IntraclusterRMSD Avg. Intercluster RMSD O 0.5 0.5 0 886 886 0.28 2.83 O 0.75 0.75 02364 2364 0.32 2.62 G 0.5 0.3 0.125 203 203 0.30 2.84 G 0.5 0.3 0.25 201201 0.30 2.84 G 0.5 0.3 0.5 188 188 0.31 2.84 G 0.5 0.3 0.75 108 1080.37 2.85 G 0.5 0.5 0.125 1913 1913 0.31 2.83 G 0.5 0.5 0.25 1197 11970.28 2.83 G 0.5 0.5 0.5 568 568 0.27 2.84 G 0.5 0.5 0.75 194 194 0.372.84 G 0.5 0.7 0.125 2038 1975 0.32 2.83 G 0.5 0.7 0.25 1199 1199 0.282.83 G 0.5 0.7 0.5 568 568 0.27 2.84 G 0.5 0.7 0.75 194 194 0.37 2.84 G0.76 0.3 0.125 175 175 0.33 2.65 G 0.75 0.3 0.25 175 175 0.33 2.65 G0.75 0.3 0.5 175 175 0.33 2.65 G 0.75 0.3 0.75 162 162 0.34 2.65 G 0.750.5 0.125 2759 2759 0.32 2.62 G 0.75 0.5 0.25 2516 2516 0.31 2.62 G 0.750.5 0.5 1601 1601 0.28 2.62 G 0.75 0.5 0.75 728 728 0.26 2.62 G 0.75 0.70.125 4067 3811 0.42 2.61 G 0.75 0.7 0.25 2916 2916 0.35 2.62 G 0.75 0.70.5 1621 1621 0.28 2.62 G 0.75 0.7 0.75 729 729 0.26 2.62

TABLE 11 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18Q19 Size of Cluster 101 118 98 112 91 91 115 109 43 95 200 184 154 113109 97 113 188 270 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Size ofCluster 104 113 125 122 125 137 256 198 156 140 143 Q1 Q2 Q3 Q4 Q5 Q6 Q7Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q1 0.31 3.46 3.16 2.242.40 1.94 2.55 2.91 2.13 2.05 3.80 2.86 2.73 2.43 2.52 2.76 2.32 2.332.13 2.63 Q2 3.40 0.37 2.11 2.88 2.27 2.99 2.97 2.73 3.40 3.85 1.94 2.502.40 2.07 2.44 4.12 2.52 3.01 2.69 4.12 Q3 3.17 2.11 0.38 3.36 2.61 3.413.02 2.73 4.14 4.04 2.33 2.33 2.32 2.97 3.07 4.75 2.85 3.52 2.72 4.74 Q42.23 2.88 3.33 0.34 2.33 1.05 1.85 2.53 2.16 2.87 3.05 2.80 2.73 2.151.88 2.99 1.32 1.93 2.31 2.95 Q5 2.40 2.26 2.00 2.33 0.37 2.09 2.53 2.563.12 2.90 2.74 2.08 2.25 2.37 2.09 3.57 2.15 2.23 2.34 3.47 Q6 1.93 2.983.38 1.65 2.08 0.34 2.53 2.66 2.67 2.66 3.22 2.80 2.74 2.61 2.25 2.651.51 2.05 2.26 2.59 Q7 2.55 2.96 2.98 1.84 2.52 2.54 0.39 1.35 3.51 3.332.84 2.18 2.19 1.12 2.52 3.89 1.85 2.52 2.67 3.85 Q8 2.93 2.73 2.72 2.552.55 2.07 1.35 0.39 3.83 3.89 2.65 2.14 2.17 1.39 2.70 4.32 1.82 2.722.70 4.30 Q9 2.35 3.04 4.12 2.75 3.11 2.68 3.52 3.83 0.27 1.43 4.62 3.493.39 3.59 2.36 1.65 3.04 2.43 2.07 1.35 Q10 2.06 3.85 4.04 2.38 2.912.60 3.34 3.85 1.42 0.30 4.56 3.59 3.49 3.48 2.40 1.87 2.97 2.42 2.382.04 Q11 3.75 1.85 2.24 3.00 2.70 3.16 2.85 2.56 4.58 4.51 0.75 2.482.61 2.69 3.24 4.99 2.49 3.34 3.31 4.91 Q12 2.85 2.43 2.27 2.82 2.012.74 2.14 2.09 3.50 3.56 2.51 0.67 1.20 2.32 2.85 4.07 2.36 2.67 2.633.95 Q13 2.80 2.44 2.25 2.74 2.22 2.70 2.20 2.10 3.37 3.45 2.66 1.220.61 2.13 2.89 3.99 2.21 2.47 2.61 4.10 Q14 2.41 2.96 2.97 2.14 2.362.62 1.11 1.37 3.58 3.48 2.75 2.31 2.16 0.40 2.58 3.90 2.01 2.34 2.663.93 Q15 2.51 2.44 3.07 1.88 2.60 2.25 2.52 2.70 2.39 2.40 3.30 2.812.86 2.59 0.34 2.73 1.80 2.18 1.65 2.72 Q16 2.77 4.11 4.73 3.00 3.602.66 3.89 4.30 1.66 1.88 5.01 4.10 4.01 3.91 2.71 0.27 3.41 2.96 2.611.08 Q17 2.31 2.52 2.84 1.32 2.16 1.51 1.86 1.82 3.04 3.00 2.57 2.402.24 2.00 1.80 3.41 0.36 1.92 2.01 3.40 Q18 2.29 2.98 3.48 1.93 2.162.00 2.51 2.67 2.42 2.45 3.36 2.68 2.47 2.31 2.18 2.94 1.84 0.65 2.192.86 Q19 2.11 2.61 2.06 2.20 2.32 2.20 2.62 2.64 2.18 2.54 3.36 2.602.58 2.77 1.53 2.53 1.94 2.17 0.74 2.71 Q20 2.64 4.11 4.72 2.96 3.482.60 3.36 4.29 1.35 2.08 4.95 3.99 4.12 3.93 2.70 1.08 3.41 2.90 2.710.27 Q21 2.42 2.52 2.83 1.69 2.39 1.84 1.98 1.84 3.16 3.07 2.42 2.202.28 1.88 1.78 3.42 1.08 2.13 2.07 3.41 Q22 2.40 2.86 3.32 1.10 2.481.96 2.07 2.51 2.94 2.00 2.86 2.05 2.76 1.85 1.91 2.99 1.59 2.11 2.373.03 Q23 3.16 2.19 2.02 2.77 2.49 3.02 2.25 2.14 3.55 3.83 2.41 1.781.55 2.20 2.81 4.22 2.26 2.79 2.53 4.24 Q24 1.34 3.26 2.95 2.70 2.352.47 3.00 3.06 1.98 2.07 3.91 2.71 2.70 3.03 2.20 2.51 2.59 2.51 1.892.52 Q25 3.12 2.17 2.06 2.92 2.37 3.02 2.22 2.13 3.67 3.93 2.26 1.581.73 2.41 2.79 4.24 2.31 2.86 2.46 4.20 Q26 2.11 2.61 2.65 2.27 2.322.21 2.63 2.64 2.18 2.55 3.37 2.00 2.56 2.78 1.53 2.57 1.95 2.18 0.722.73 Q27 2.37 2.81 2.71 2.33 1.44 1.90 2.60 2.90 2.85 2.47 2.98 1.992.36 2.48 2.14 2.99 2.24 2.33 1.91 2.81 Q28 2.22 2.34 2.69 2.35 1.042.18 2.47 2.84 2.87 2.76 2.80 2.14 1.92 2.54 2.25 2.83 2.29 2.49 2.073.00 Q29 2.36 3.16 3.80 2.44 3.00 4.18 3.37 3.49 2.12 2.03 4.08 3.273.14 3.27 1.95 2.02 2.82 1.87 2.28 1.77 Q30 2.53 3.10 3.80 2.20 2.841.90 3.22 3.47 2.16 2.13 4.12 3.18 3.29 3.41 1.94 1.79 2.67 2.16 2.292.02 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q1 2.43 2.40 3.14 1.33 3.122.13 2.41 2.21 2.55 2.53 Q2 2.52 2.86 2.21 3.25 2.19 2.68 2.34 2.33 3.163.15 Q3 2.84 3.33 2.02 2.94 2.06 2.71 2.78 2.70 3.85 3.86 Q4 1.70 1.102.78 2.76 2.92 2.30 2.27 2.33 2.43 2.20 Q5 2.38 2.47 2.49 2.35 2.37 2.351.48 1.63 2.99 2.83 Q6 1.86 1.96 3.01 2.43 2.98 2.26 2.00 2.18 2.17 1.89Q7 1.96 2.07 2.24 3.00 2.22 2.66 2.64 2.46 3.38 3.22 Q8 1.83 2.55 2.133.04 2.13 2.69 2.94 2.85 3.49 3.46 Q9 3.15 2.94 3.57 1.95 3.68 2.07 2.802.87 2.11 2.16 Q10 3.07 2.60 3.82 2.05 3.93 2.38 2.52 2.76 2.03 2.14 Q112.33 2.80 2.36 3.85 2.19 3.30 2.91 2.73 4.10 4.14 Q12 2.15 2.64 1.722.66 1.49 2.63 1.98 2.06 3.21 3.07 Q13 2.23 2.76 1.51 2.65 1.68 2.602.26 1.86 3.08 3.24 Q14 1.87 1.84 2.21 3.03 2.40 2.66 2.52 2.52 3.263.42 Q15 1.78 1.93 2.83 2.20 2.80 1.64 2.20 2.27 2.00 1.94 Q16 3.43 3.004.23 2.52 4.25 2.62 3.00 2.85 2.04 1.79 Q17 1.09 1.58 2.26 2.58 2.312.01 2.28 2.30 2.83 2.66 Q18 2.08 2.07 2.80 2.44 2.86 2.19 2.37 2.441.79 2.09 Q19 2.01 2.40 2.50 1.81 2.38 0.72 1.96 2.14 2.20 2.20 Q20 3.413.03 4.25 2.53 4.21 2.71 2.86 3.00 1.77 2.02 Q21 0.39 1.29 2.29 2.572.21 2.06 2.43 2.23 2.68 2.83 Q22 1.31 0.33 2.92 2.78 2.76 2.36 2.342.24 2.24 2.45 Q23 2.28 2.92 0.41 3.02 1.11 2.52 2.48 2.25 3.40 3.47 Q242.59 2.77 3.03 0.32 3.02 1.89 2.29 2.24 2.26 2.24 Q25 2.20 2.75 1.113.01 0.38 2.45 2.32 2.43 3.47 3.40 Q26 2.02 2.41 2.48 1.80 2.38 0.721.97 2.16 2.21 2.21 Q27 2.44 2.30 2.47 2.22 2.30 1.91 0.56 1.12 2.532.30 Q28 2.23 2.25 2.26 2.24 2.47 2.07 1.21 0.34 2.28 2.47 Q29 2.69 2.243.42 2.25 3.48 2.28 2.50 2.29 0.29 1.09 Q30 2.82 2.46 3.48 2.23 3.412.29 2.27 2.47 1.08 0.31

TABLE 12 Method Family TOL (Å) TOL (Å) Sealevel (% peak)$\sum\limits_{i = 1}^{30}{C_{i}}$${Unique}\quad{\sum\limits_{i = 1}^{30}{C_{i}}}$ Avg. IntraclusterRMSD Avg. Intercluster RMSD O 0.5 0.5 0 414 407 0.32 2.73 O 0.75 0.75 01227 1227 0.34 2.60 G 0.5 0.3 0.125 75 75 0.22 2.74 G 0.5 0.3 0.25 75 750.22 2.74 G 0.5 0.3 0.5 73 73 0.22 2.74 G 0.5 0.3 0.75 57 57 0.18 2.74 G0.5 0.5 0.125 799 774 0.31 2.73 G 0.5 0.5 0.25 626 607 0.30 2.73 G 0.50.5 0.5 336 328 0.31 2.74 G 0.5 0.5 0.75 132 131 0.35 2.73 G 0.5 0.70.125 1008 977 0.33 2.73 G 0.5 0.7 0.25 713 690 0.31 2.73 G 0.5 0.7 0.5343 335 0.31 2.74 G 0.5 0.7 0.75 133 131 0.36 2.74 G 0.75 0.3 0.125 7575 0.37 2.63 G 0.75 0.3 0.25 75 75 0.37 2.63 G 0.75 0.3 0.5 75 75 0.372.63 G 0.75 0.3 0.75 72 72 0.31 2.63 G 0.75 0.5 0.125 924 924 0.32 2.61G 0.75 0.5 0.25 875 875 0.32 2.61 G 0.75 0.5 0.5 632 632 0.30 2.61 G0.75 0.5 0.75 304 304 0.34 2.61 G 0.75 0.7 0.125 1874 1874 0.37 2.60 G0.75 0.7 0.25 1429 1429 0.34 2.60 G 0.75 0.7 0.5 773 773 0.31 2.61 G0.75 0.7 0.75 315 315 0.34 2.61 G 0.75 0.9 0.125 2042 1904 0.39 2.60 G0.75 0.9 0.25 1431 1431 0.34 2.60 G 0.75 0.9 0.5 773 773 0.31 2.61 G0.75 0.9 0.75 315 315 0.34 2.61

TABLE 13 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18Q19 Size of Cluster 52 50 49 45 60 60 53 50 47 60 54 62 71 63 61 72 6463 138 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Size of Cluster 56 5567 138 78 79 73 67 79 77 101 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q1 0.38 1.11 2.67 2.31 2.36 3.10 2.90 2.932.85 2.29 2.61 2.55 2.87 2.15 2.71 2.20 1.60 2.63 2.39 1.87 Q2 1.13 0.342.53 2.50 2.26 2.94 3.04 2.72 2.92 2.37 2.71 2.60 2.94 2.34 2.70 2.371.84 2.51 2.29 2.16 Q3 2.67 2.51 0.41 3.65 2.28 2.82 2.79 2.03 3.08 2.573.22 2.72 2.17 2.97 1.92 2.54 2.74 2.14 2.66 3.07 Q4 2.31 2.51 3.05 0.322.48 4.28 3.53 3.91 2.03 3.04 2.02 2.76 3.91 2.43 3.20 3.32 2.49 3.563.28 1.74 Q5 2.36 2.25 2.28 2.48 0.59 3.07 2.43 2.64 2.67 2.48 2.34 2.272.60 2.18 1.72 2.48 1.96 2.39 2.51 2.01 Q6 3.10 2.94 2.83 4.27 3.07 0.373.14 2.56 4.05 2.73 3.90 3.59 2.05 2.93 3.09 2.42 2.70 2.64 2.63 3.89 Q72.91 3.03 2.79 3.53 2.44 3.14 0.41 2.62 3.09 2.33 3.03 2.23 2.15 3.072.32 2.22 2.72 2.57 2.47 2.73 Q8 2.94 2.71 2.64 3.91 2.65 2.56 2.60 0.374.00 2.83 3.50 2.83 1.89 3.19 2.69 2.80 2.79 1.97 2.82 3.18 Q9 2.87 2.933.98 2.05 2.67 4.66 3.10 4.01 0.34 3.35 2.37 2.78 4.10 2.79 3.30 3.502.88 3.86 3.53 2.28 Q10 2.32 2.38 2.59 3.04 2.50 2.73 2.34 2.84 3.340.36 2.02 2.35 2.43 2.34 2.27 1.31 2.25 2.61 1.46 2.36 Q11 2.60 2.683.23 2.02 2.82 3.88 3.05 3.40 2.37 2.90 0.37 2.83 3.44 2.03 2.97 3.192.60 3.30 8.08 2.39 Q12 2.55 2.62 2.72 2.75 2.20 3.59 2.22 2.83 2.792.34 2.84 0.40 2.88 2.79 2.02 2.29 2.54 2.60 2.48 2.17 Q13 2.87 2.942.17 3.90 2.60 2.06 2.15 1.88 4.10 2.41 3.44 2.88 0.43 2.96 2.56 1382.65 2.37 2.34 3.23 Q14 2.13 2.34 2.98 2.43 2.18 2.93 3.07 3.20 2.792.33 2.04 2.80 2.96 0.85 2.80 2.66 1.83 2.86 2.32 2.67 Q15 2.71 2.681.91 3.20 1.72 3.09 2.32 2.70 3.30 2.28 2.98 2.03 2.55 2.80 0.42 2.092.50 2.13 2.04 2.45 Q16 2.20 2.37 2.54 3.31 2.46 2.42 2.22 2.80 3.481.30 3.17 2.20 2.38 2.65 2.08 0.39 2.07 2.38 1.33 2.35 Q17 1.60 1.832.76 2.48 1.97 2.71 2.78 2.78 2.86 2.22 2.65 2.50 2.66 1.83 2.50 2.080.38 2.43 1.85 3.12 Q18 2.03 2.51 2.13 3.55 2.39 2.64 2.56 1.97 3.842.50 3.29 2.65 2.36 2.88 2.13 2.38 2.43 0.39 2.33 3.15 Q19 2.37 2.252.03 3.22 2.50 2.74 2.41 2.77 3.53 1.34 2.95 2.51 2.28 2.44 1.95 1.211.70 2.39 0.73 2.45 Q20 1.88 2.15 3.07 1.74 2.02 3.89 2.72 3.19 2.272.36 2.39 2.17 3.24 2.66 2.47 2.36 2.12 3.13 2.52 0.87 Q21 2.14 1.903.25 1.97 2.19 3.74 2.93 3.07 2.47 2.58 2.28 1.94 3.12 2.60 2.43 2.602.32 3.00 2.38 1.11 Q22 2.43 2.30 2.18 2.94 1.90 3.14 2.41 3.01 2.882.21 2.09 2.19 2.88 2.58 1.33 2.44 2.35 2.73 2.20 2.48 Q23 2.36 2.252.63 3.22 2.50 2.72 2.41 2.77 3.53 1.35 2.96 2.51 2.28 2.44 1.95 1.201.75 2.38 0.73 2.44 Q24 2.75 2.77 2.93 2.15 2.12 3.82 2.71 3.17 2.272.95 1.30 2.58 3.07 2.30 2.63 3.02 2.63 3.10 2.89 2.41 Q25 2.79 2.662.76 2.30 1.86 3.75 2.74 3.34 2.05 2.83 1.04 2.50 3.24 2.02 2.64 2.882.49 8.24 2.99 2.22 Q26 1.87 1.62 2.65 2.37 2.09 2.50 2.78 2.64 2.051.98 2.55 2.37 2.69 2.05 2.51 1.75 1.10 2.18 2.03 2.85 Q27 2.96 2.944.08 2.13 3.05 4.95 3.86 4.70 1.80 3.77 2.42 3.37 4.40 2.92 3.00 3.003.10 4.39 3.80 2.81 Q28 2.37 2.17 3.20 2.13 2.09 3.89 2.90 2.99 2.002.70 2.12 1.70 3.13 2.41 2.76 2.99 3.40 2.98 2.74 1.60 Q29 2.21 2.323.06 1.93 1.91 3.89 2.85 3.19 1.09 2.53 2.18 2.10 3.33 2.48 2.78 2.882.24 3.17 2.90 1.31 Q30 2.90 2.59 2.86 3.12 2.28 3.04 1.35 2.03 2.502.40 2.73 2.39 2.30 2.74 2.15 2.37 2.31 2.96 2.35 2.64 Q21 Q22 Q28 Q24Q25 Q26 Q27 Q28 Q29 Q30 Q1 2.14 2.42 2.39 2.73 2.79 1.86 2.94 2.37 2.192.92 Q2 1.90 2.30 2.29 2.78 2.66 1.63 2.94 2.16 2.32 2.92 Q3 3.25 2.172.66 2.93 2.75 2.64 4.07 3.19 3.06 2.83 Q4 1.97 2.94 3.28 2.15 2.30 2.872.12 2.13 1.93 3.11 Q5 2.19 1.90 2.51 2.12 1.90 2.09 3.06 2.10 1.92 2.30Q6 3.75 3.14 2.63 3.86 3.75 2.50 4.95 3.91 3.88 3.05 Q7 2.92 2.40 2.472.70 2.73 2.76 3.85 2.90 2.85 1.34 Q8 3.06 3.00 2.82 3.17 3.34 2.65 4.722.98 3.19 2.64 Q9 2.48 2.89 3.53 2.28 2.06 2.96 1.87 2.01 1.71 2.83 Q102.60 2.22 1.46 2.95 2.84 1.93 3.76 2.70 2.52 2.37 Q11 2.28 2.69 3.031.30 1.63 2.52 2.42 2.11 2.17 2.65 Q12 1.93 2.17 2.48 2.58 2.49 2.373.36 1.76 2.09 2.38 Q13 3.12 2.87 2.34 3.06 3.24 2.69 4.45 3.14 3.332.27 Q14 2.59 2.59 2.32 2.30 2.03 2.03 2.93 2.41 2.49 2.74 Q15 2.41 1.312.04 2.61 2.52 2.51 3.64 2.74 2.78 2.16 Q16 2.60 2.45 1.33 2.99 2.871.74 3.89 2.99 2.89 2.37 Q17 2.34 2.37 1.85 2.63 2.50 1.11 3.09 2.412.25 2.32 Q18 3.06 2.71 2.33 3.10 3.25 2.17 4.38 2.97 3.17 2.95 Q19 2.302.17 0.73 2.84 2.93 1.95 3.93 2.72 2.92 2.28 Q20 1.10 2.48 2.52 2.392.20 2.34 2.82 1.59 1.31 2.64 Q21 0.37 2.35 2.38 2.23 2.36 2.12 2.831.32 1.72 2.56 Q22 2.35 0.37 2.20 2.38 2.13 2.33 3.45 2.72 2.62 2.13 Q232.29 2.19 0.73 2.84 2.93 1.95 3.92 2.72 2.92 2.29 Q24 2.24 2.39 2.890.33 1.10 2.54 2.16 2.03 2.21 2.22 Q25 2.36 2.14 2.99 1.10 0.35 2.542.14 2.24 2.02 2.24 Q26 2.13 2.33 2.03 2.55 2.57 0.36 3.07 2.27 2.412.26 Q27 2.81 3.43 3.86 2.16 2.15 3.08 0.28 2.57 2.52 3.42 Q28 1.32 2.712.74 2.03 2.24 2.27 2.55 0.32 1.10 2.62 Q29 1.71 2.61 2.90 2.20 2.012.41 2.50 1.10 0.35 2.04 Q30 2.55 2.10 2.35 2.23 2.20 2.24 3.42 2.602.63 0.47

TABLE 14 H: α-helix B: residue in isolated beta-bridge E: extendedstrand, participates in beta ladder G: 3-helix (3/10 helix) I: 5 helix(π helix) T: hydrogen bonded turn S: bend U: no assignment

TABLE 15 Cluster |C_(i)| H B E G I T S U C1 616 0.96 0.00 0.00 0.00 0.000.02 0.01 0.01 C2 682 0.01 0.03 0.56 0.01 0.00 0.03 0.05 0.32 C3 2780.80 0.00 0.00 0.06 0.00 0.09 0.01 0.03 C4 258 0.04 0.00 0.00 0.00 0.000.03 0.00 0.02 C5 609 0.96 0.00 0.00 0.00 0.00 0.02 0.01 0.01 C6 18090.96 0.00 0.00 0.00 0.00 0.03 0.00 0.01 C7 255 0.92 0.00 0.00 0.00 0.000.06 0.01 0.01 C8 264 0.94 0.00 0.00 0.01 0.00 0.03 0.01 0.01 C9 3000.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C10 274 0.95 0.00 0.00 0.00 0.000.03 0.00 0.01 C11 266 0.93 0.00 0.00 0.01 0.00 0.04 0.01 0.01 C12 4230.95 0.00 0.00 0.00 0.00 0.02 0.01 0.01 C13 303 0.97 0.00 0.00 0.00 0.000.01 0.01 0.01 C14 1750 0.96 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C15 3100.96 0.00 0.00 0.00 0.00 0.03 0.00 0.01 C16 304 0.96 0.00 0.00 0.00 0.000.02 0.00 0.01 C17 476 0.96 0.00 0.00 0.00 0.00 0.02 0.01 0.01 C18 12620.96 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C19 345 0.96 0.00 0.00 0.00 0.000.02 0.01 0.01 C20 340 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C21 9440.96 0.00 0.00 0.00 0.00 0.02 0.01 0.01 C22 1667 0.97 0.00 0.00 0.000.00 0.02 0.00 0.01 C23 1648 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C24386 0.96 0.00 0.00 0.00 0.00 0.02 0.01 0.01 C25 882 0.96 0.00 0.00 0.000.00 0.02 0.00 0.01 C26 428 0.96 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C27855 0.96 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C28 473 0.96 0.00 0.00 0.000.00 0.02 0.00 0.01 C29 515 0.91 0.00 0.00 0.00 0.00 0.03 0.01 0.01 C30861 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01

TABLE 16 Cluster |C_(i)| H B E G I T S U C1 298 0.97 0.00 0.00 0.00 0.000.02 0.01 0.01 C2 238 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C3 8210.96 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C4 309 0.97 0.00 0.00 0.00 0.000.02 0.01 0.00 C5 245 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.00 C6 1740.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C7 147 0.96 0.00 0.00 0.00 0.000.03 0.01 0.00 C8 168 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C9 1780.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C10 296 0.98 0.00 0.00 0.00 0.000.01 0.00 0.01 C11 191 0.97 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C12 5870.97 0.00 0.00 0.00 0.00 0.02 0.00 0.00 C13 305 0.97 0.00 0.00 0.00 0.000.02 0.00 0.01 C14 281 0.98 0.00 0.00 0.00 0.00 0.02 0.00 0.00 C15 1850.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C16 182 0.95 0.00 0.00 0.00 0.000.04 0.00 0.01 C17 270 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C18 1850.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C19 519 0.96 0.00 0.00 0.00 0.000.02 0.01 0.01 C20 196 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C21 3080.98 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C22 772 0.97 0.00 0.00 0.00 0.000.02 0.00 0.01 C23 716 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C24 2320.96 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C25 702 0.97 0.00 0.00 0.00 0.000.02 0.00 0.01 C26 239 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C27 6500.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C28 491 0.97 0.00 0.00 0.00 0.000.02 0.00 0.01 C29 284 0.97 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C30 2980.97 0.00 0.00 0.00 0.00 0.02 0.00 0.00

TABLE 17 Cluster |C_(i)| H B E G I T S U C1 101 0.97 0.00 0.00 0.00 0.000.02 0.00 0.01 C2 113 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.00 C3 98 0.980.00 0.00 0.00 0.00 0.01 0.01 0.01 C4 112 0.97 0.00 0.00 0.00 0.00 0.020.00 0.01 C5 91 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C6 91 0.98 0.000.00 0.00 0.00 0.01 0.00 0.01 C7 115 0.98 0.00 0.00 0.00 0.00 0.01 0.000.00 C8 109 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C9 93 0.97 0.00 0.000.00 0.00 0.02 0.00 0.01 C10 95 0.96 0.00 0.00 0.00 0.00 0.03 0.01 0.01C11 200 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C12 184 0.98 0.00 0.000.00 0.00 0.02 0.00 0.00 C13 154 0.98 0.00 0.00 0.00 0.00 0.02 0.00 0.00C14 113 0.97 0.00 0.00 0.00 0.00 0.01 0.01 0.01 C15 109 0.98 0.00 0.000.00 0.00 0.02 0.00 0.01 C16 97 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01C17 115 0.97 0.00 0.00 0.00 0.00 0.03 0.00 0.01 C18 188 0.98 0.00 0.000.00 0.00 0.02 0.00 0.00 C19 270 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.00C20 104 0.96 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C21 113 0.98 0.00 0.000.00 0.00 0.01 0.00 0.01 C22 125 0.97 0.00 0.00 0.00 0.00 0.02 0.01 0.01C23 122 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C24 125 0.97 0.00 0.000.00 0.00 0.02 0.00 0.01 C25 137 0.98 0.00 0.00 0.00 0.00 0.01 0.01 0.00C26 256 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.00 C27 198 0.98 0.00 0.000.00 0.00 0.01 0.00 0.01 C28 156 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.01C29 140 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.00 C30 143 0.98 0.00 0.000.00 0.00 0.01 0.00 0.01

TABLE 18 Cluster |C_(i)| H B E G I T S U C1 52 0.97 0.00 0.00 0.01 0.000.01 0.00 0.01 C2 50 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C3 49 0.990.00 0.00 0.00 0.00 0.01 0.00 0.01 C4 45 0.98 0.00 0.00 0.00 0.00 0.010.00 0.01 C5 60 0.98 0.00 0.00 0.00 0.00 0.02 0.00 0.00 C6 60 0.99 0.000.00 0.00 0.00 0.01 0.00 0.00 C7 53 0.99 0.00 0.00 0.00 0.00 0.01 0.000.00 C8 50 0.99 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C9 47 0.97 0.00 0.000.00 0.00 0.02 0.01 0.00 C10 60 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00C11 54 0.98 0.00 0.00 0.00 0.00 0.02 0.00 0.00 C12 62 0.97 0.00 0.000.00 0.00 0.02 0.01 0.00 C13 71 0.99 0.00 0.00 0.00 0.00 0.01 0.00 0.01C14 61 0.98 0.00 0.00 0.00 0.00 0.02 0.00 0.00 C15 61 0.99 0.00 0.000.00 0.00 0.00 0.00 0.01 C16 72 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.01C17 64 0.97 0.00 0.00 0.00 0.00 0.03 0.00 0.01 C18 63 0.98 0.00 0.000.00 0.00 0.01 0.00 0.00 C19 138 0.99 0.00 0.00 0.00 0.00 0.01 0.00 0.01C20 56 0.98 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C21 55 0.98 0.00 0.000.00 0.00 0.01 0.00 0.01 C22 67 0.97 0.00 0.00 0.00 0.00 0.01 0.01 0.00C23 138 0.99 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C24 78 0.98 0.00 0.000.00 0.00 0.01 0.01 0.01 C25 79 0.98 0.00 0.00 0.00 0.00 0.02 0.00 0.00C26 73 0.97 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C27 67 0.98 0.00 0.000.00 0.00 0.01 0.00 0.01 C28 79 0.97 0.00 0.00 0.00 0.00 0.02 0.01 0.01C29 77 0.99 0.00 0.00 0.00 0.00 0.01 0.00 0.01 C30 101 0.98 0.00 0.000.00 0.00 0.01 0.00 0.00

TABLE 19 Motif size 4 5 6 7 C1 0.08 0.07 0.09 0.06 C2 0.99 0.03 0.040.04 C3 0.21 0.08 0.05 0.04 C4 0.07 0.06 0.04 0.00 C5 0.08 0.06 0.030.03 C6 0.07 0.03 0.02 0.00 C7 0.09 0.05 0.04 0.02 C8 0.05 0.04 0.050.02 C9 0.06 0.05 0.04 0.09 C10 0.05 0.04 0.05 0.00 C11 0.06 0.07 0.040.09 C12 0.08 0.06 0.06 0.10 C13 0.06 0.07 0.05 0.03 C14 0.06 0.02 0.060.00 C15 0.03 0.03 0.02 0.08 C16 0.03 0.05 0.05 0.03 C17 0.06 0.04 0.040.06 C18 0.05 0.02 0.03 0.05 C19 0.06 0.07 0.07 0.03 C20 0.05 0.03 0.050.05 C21 0.06 0.04 0.05 0.07 C22 0.06 0.07 0.07 0.07 C23 0.05 0.07 0.050.03 C24 0.03 0.03 0.09 0.05 C25 0.05 0.06 0.05 0.06 C26 0.02 0.07 0.060.04 C27 0.05 0.06 0.03 0.03 C28 0.03 0.06 0.04 0.09 C29 0.04 0.02 0.030.03 C30 0.05 0.02 0.02 0.05

TABLE 20 Size of motif 4 5 6 7 C1 UTBU UUUUU UUUUUU HHHHHHS C2 UUUUUUUUU UUUUUU HHHHHHT C3 UUUU UUUUU UUUUUU HHHHHHS C4 UUUU UUUUU UUUUUUHHHHHHU C5 HHHU UUUUU UUUUUU UUUUUUU C6 EEEE UUUUU UUUUUU UUUUUUU C7UUUH UUUUU UUUUUU UUUUUUU C8 UTUH UUUUU UUUUUU UUUUUUU C9 SUUU UUUUUUUUUUU UUUUUUU C10 UUUU UUUUU UUUUUU UUUUUUU C11 UUUB UUUUU UUUUUUUUUUUUU C12 UUUU UUUUU UUUUUU UUUUUUU C13 UUEE UUUUU UUUUUU UUUUUUU C14EUUG UUUUU UUUUUU UUUUUUU C15 UUUU UUUUU UUUUUU UUUUUUU C16 UUUU UUUUUUUUUUU UUUUUUU C17 HHHS HHHHU UUUUUU UUUUUUU C18 EEEE UUUUU UUUUUUUUUUUUU C19 GGSU HHHHS UUUUUU UUUUUUU C20 EEEU UUUUU UUUUUU UUUUUUU C21EEEE UUUUU HHHHHU UUUUUUU C22 EEEE UUUUU UUUUUU UUUUUUU C23 EEEE HHHHSUUUUUU UUUUUUU C24 SEEE EEEEE UUUUUU UUUUUUU C25 HHHS EEEEE UUUUUUUUUUUUU C26 EEEE HHHHU UUUUUU UUUUUUU C27 EEEE EEEEE UUUUUU UUUUUUU C28EEES EEEEE HHHHHU UUUUUUU C29 EEEE HHHHU HHHHHU UUUUUUU C30 EEEE EEEEEEEEEEE UUUUUUU

TABLE 21 Method Size of motif TOL Sealevel (% peak)$\sum\limits_{i = 1}^{30}{C_{i}}$${Unique}\quad{\sum\limits_{i = 1}^{30}{C_{i}}}$ Avg. IntraclusterRMSD Avg. Intercluster RMSD O 4 0.75 0 1625 1573 0.54 2.69 G 4 0.5 0.256671 3770 0.52 2.69 O 5 0.75 0 712 712 0.29 2.72 G 5 0.7 0.125 1264 12640.37 2.72 G 5 0.9 0.125 2298 2298 0.43 2.74 O 6 0.75 0 591 591 0.25 2.50G 6 0.7 0.125 680 680 0.26 2.50 G 6 0.9 0.125 771 771 0.27 2.49 G 6 1.10.125 771 771 0.27 2.49 O 7 0.75 0 479 479 0.26 2.33 G 7 0.9 0.125 532532 0.26 2.33 G 7 1.1 0.125 532 632 0.26 2.33LegendO: One-pass algorithmG: Greedy algorithm

TABLE 22 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18Q19 Size of Cluster 8 20 20 19 38 55 23 3 1 19 2 20 110 474 31 26 50 991687 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Size of Cluster 136 111102 107 1649 79 236 228 342 299 682 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q1 0.50 3.29 2.50 2.93 3.05 1.533.68 1.04 1.20 2.52 3.48 1.92 4.12 3.29 2.09 3.12 1.37 2.47 4.13 3.28 Q23.32 0.20 1.70 2.61 2.38 3.14 4.13 3.60 5.19 1.98 3.87 2.88 5.05 3.704.02 2.77 3.24 3.17 4.96 2.31 Q3 2.53 1.70 0.38 1.78 1.57 2.25 3.30 2.993.99 2.04 2.85 1.80 3.90 2.50 2.88 2.05 2.68 2.18 3.82 1.64 Q4 2.93 2.651.80 0.30 2.45 2.70 2.55 3.80 3.30 3.10 2.21 2.89 3.34 2.05 2.68 1.572.86 1.52 3.23 1.56 Q5 3.03 2.35 1.52 2.39 0.41 2.43 3.66 3.08 1.37 2.303.28 1.84 4.15 2.98 3.02 2.79 3.11 2.54 4.03 2.38 Q6 1.51 3.12 2.22 2.712.46 0.49 3.27 2.18 3.80 2.80 2.99 2.28 3.69 2.94 2.26 2.90 1.62 2.423.66 2.96 Q7 3.68 4.12 3.29 2.59 3.66 3.28 0.24 4.80 1.57 4.61 0.03 3.831.65 1.66 1.78 1.70 3.63 2.34 1.64 2.40 Q8 1.67 3.55 2.95 3.81 3.08 2.164.89 0.51 5.50 2.18 4.68 1.96 5.33 4.40 1.04 4.27 2.08 3.47 5.31 4.12 Q91.24 5.19 4.00 3.39 4.40 3.83 1.62 5.50 0.00 5.51 1.61 4.48 0.86 1.951.91 2.79 4.30 3.03 1.21 3.30 Q10 2.51 1.94 2.04 3.12 2.37 2.76 4.602.25 5.49 0.35 4.43 1.07 5.29 4.02 4.15 3.61 2.53 3.30 5.19 3.15 Q113.42 4.17 3.07 2.28 3.44 2.94 1.12 4.03 1.43 4.58 0.83 3.82 1.45 1.311.31 1.94 3.46 1.91 1.41 2.41 Q12 1.97 2.88 1.82 2.86 1.89 2.23 3.842.03 1.48 1.66 3.76 0.15 4.33 3.27 3.30 3.24 2.03 2.64 4.28 2.92 Q134.11 5.08 3.92 3.40 4.21 3.67 1.61 5.32 0.66 5.37 1.57 4.35 0.57 1.841.70 2.79 4.24 2.87 1.11 3.20 Q14 3.26 3.04 2.43 1.94 2.90 2.83 1.584.37 1.82 4.00 0.99 3.24 1.72 0.72 1.57 1.54 3.40 1.82 1.67 1.82 Q152.08 4.01 2.86 2.70 3.04 2.28 1.76 4.03 1.84 4.18 1.41 3.28 1.74 1.070.43 2.35 3.11 2.18 1.74 2.65 Q16 3.13 2.74 2.00 1.52 2.76 2.90 1.694.22 2.74 3.55 1.51 3.26 2.77 1.61 2.34 0.38 3.10 1.83 2.70 1.37 Q171.38 3.27 2.72 2.92 3.12 1.65 3.04 2.08 1.29 2.56 3.51 2.03 4.26 3.473.12 3.11 0.35 2.74 4.24 3.37 Q18 2.45 3.17 2.10 1.52 2.56 2.40 2.323.46 2.98 3.27 1.06 2.64 2.88 1.87 2.16 1.86 2.72 0.57 2.87 2.04 Q194.14 4.96 3.88 3.35 4.18 3.70 1.62 3.38 0.96 5.21 1.57 4.19 1.06 1.781.81 2.76 4.19 3.09 1.23 3.15 Q20 3.27 2.22 1.51 1.38 2.32 2.92 2.364.12 3.23 3.13 2.01 2.92 3.15 1.81 2.50 1.19 3.84 1.07 3.06 0.70 Q211.94 2.77 1.78 1.94 2.31 2.09 2.85 2.72 3.53 2.17 2.68 1.49 3.39 2.262.51 2.21 2.12 1.84 3.30 2.12 Q22 3.25 2.16 1.60 1.50 2.33 2.87 2.294.09 3.30 3.16 1.96 2.96 3.20 1.93 2.55 1.18 3.31 1.93 3.11 0.81 Q231.84 2.74 1.85 1.89 2.45 2.00 2.75 2.74 3.45 2.25 2.51 1.77 3.32 2.252.41 2.00 2.12 1.74 3.26 2.10 Q24 4.08 4.93 3.79 3.19 4.12 3.59 1.505.30 0.86 5.25 1.32 4.40 0.86 1.07 1.59 2.05 4.18 2.82 0.86 3.08 Q253.30 3.41 2.67 2.01 2.87 2.74 1.62 4.26 2.71 4.16 1.56 3.83 2.63 2.051.98 1.79 3.12 1.92 2.64 2.13 Q26 2.49 4.15 2.85 2.71 2.98 2.61 2.363.47 2.78 3.49 2.52 2.28 2.57 2.05 2.13 2.67 2.63 2.19 2.54 2.89 Q272.30 4.15 2.86 2.77 3.01 2.63 2.40 3.48 2.78 3.48 2.53 2.30 2.58 2.082.15 2.68 2.63 2.19 2.55 2.91 Q28 3.25 3.59 2.36 1.90 2.38 2.88 1.654.35 1.90 3.94 1.04 3.18 1.79 0.72 1.61 1.51 3.41 1.80 1.74 1.76 Q293.19 3.55 2.38 1.08 2.91 2.88 1.68 4.33 2.01 3.88 1.09 3.16 1.82 0.801.62 1.51 3.36 1.70 1.77 1.80 Q30 4.04 4.83 3.69 3.08 3.98 3.52 1.485.22 0.94 5.10 1.31 4.31 0.91 1.60 1.51 2.60 4.16 2.72 0.87 2.96 Q21 Q22Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30 Q1 1.99 3.25 1.86 4.12 3.28 2.53 2.533.27 8.22 4.05 Q2 2.81 2.22 2.78 4.94 3.42 4.07 4.08 3.63 8.60 4.85 Q31.78 1.70 1.86 3.82 2.74 2.82 2.82 2.42 2.45 3.71 Q4 2.01 1.60 1.94 3.231.98 2.71 2.72 1.99 2.08 3.10 Q5 2.30 2.37 2.44 4.03 2.85 2.96 2.97 2.912.94 3.96 Q6 2.13 2.89 2.02 3.66 2.73 2.60 2.61 2.91 2.92 3.54 Q7 2.872.31 2.75 1.68 1.60 2.37 2.87 1.69 1.74 1.53 Q8 2.74 4.09 2.75 5.31 4.273.49 3.49 4.37 4.34 5.23 Q9 3.54 3.35 3.48 1.20 2.73 2.72 2.72 1.99 2.101.12 Q10 2.24 3.17 2.82 5.19 4.18 3.52 3.52 3.96 3.89 5.14 Q11 2.74 2.372.59 1.40 1.70 2.32 2.33 1.32 1.42 1.19 Q12 1.57 2.97 1.83 4.28 3.842.36 2.86 3.22 3.19 4.26 Q13 3.45 3.21 3.35 1.11 2.59 2.57 2.58 1.881.91 0.95 Q14 2.23 1.91 2.20 1.66 2.04 2.01 2.01 0.65 0.76 1.52 Q15 2.562.58 2.43 1.73 1.96 2.04 2.05 1.68 1.72 1.54 Q16 2.24 1.21 2.08 2.701.77 2.66 2.68 1.55 1.56 2.01 Q17 2.17 3.35 2.15 4.24 3.12 2.68 2.673.45 3.41 4.19 Q18 1.89 1.96 1.75 2.86 1.87 2.21 2.21 1.82 1.82 2.72 Q193.82 3.31 3.31 0.99 2.69 2.42 2.41 1.82 1.84 0.97 Q20 2.12 0.73 2.083.06 2.11 2.87 2.87 1.73 1.79 2.93 Q21 0.61 2.23 0.79 3.30 2.68 1.651.65 2.22 2.18 3.24 Q22 2.23 0.57 2.11 3.11 1.89 2.96 2.97 1.86 1.892.97 Q23 0.78 2.11 0.63 3.26 2.06 1.77 1.77 2.21 2.17 3.18 Q24 3.34 3.093.26 0.89 2.44 2.55 2.55 1.71 1.76 0.66 Q25 2.69 1.93 2.60 2.53 0.312.61 2.62 2.05 2.10 2.38 Q26 1.65 2.99 1.79 2.54 2.66 0.65 0.63 2.052.00 2.57 Q27 1.69 3.00 1.79 2.55 2.69 0.66 0.65 2.08 2.00 2.58 Q28 2.211.86 2.18 1.73 2.06 2.02 2.02 0.64 0.75 1.59 Q29 2.18 1.87 2.15 1.762.10 1.97 1.97 0.73 0.65 1.65 Q30 3.26 2.98 3.18 0.87 2.34 2.53 2.531.63 1.71 0.61

TABLE 23 Cluster |C_(i)| H B E G I T S U C1 3 0.67 0.08 0.00 0.00 0.000.08 0.00 0.17 C2 20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C3 20 0.000.00 0.00 0.00 0.00 0.00 0.00 1.00 C4 19 0.00 0.00 0.00 0.00 0.00 0.000.00 1.00 C5 38 0.93 0.00 0.00 0.01 0.00 0.03 0.01 0.02 C6 55 0.01 0.010.83 0.00 0.00 0.00 0.03 0.12 C7 23 0.01 0.00 0.01 0.00 0.00 0.00 0.000.98 C8 3 0.25 0.00 0.00 0.17 0.00 0.25 0.00 0.33 C9 1 0.00 0.00 0.000.00 0.00 0.25 0.00 0.75 C10 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00C11 2 0.00 0.12 0.00 0.00 0.00 0.00 0.00 0.88 C12 20 0.00 0.00 0.00 0.000.00 0.00 0.00 1.00 C13 110 0.00 0.03 0.50 0.00 0.00 0.08 0.07 0.30 C14474 0.01 0.02 0.63 0.01 0.00 0.02 0.04 0.26 C15 31 0.01 0.05 0.04 0.020.00 0.02 0.02 0.85 C16 26 0.01 0.01 0.01 0.00 0.00 0.00 0.01 0.96 C1750 0.95 0.00 0.00 0.00 0.00 0.03 0.01 0.02 C18 99 0.01 0.00 0.81 0.000.00 0.09 0.03 0.16 C19 1687 0.03 0.03 0.39 0.02 0.00 0.06 0.11 0.35 C20136 0.01 0.01 0.74 0.00 0.00 0.01 0.04 0.19 C21 111 0.00 0.00 0.80 0.000.00 0.00 0.02 0.16 C22 102 0.01 0.01 0.73 0.00 0.00 0.01 0.03 0.20 C23107 0.00 0.00 0.82 0.00 0.00 0.01 0.03 0.12 C24 1649 0.07 0.03 0.35 0.020.00 0.07 0.10 0.35 C25 79 0.94 0.00 0.00 0.00 0.00 0.04 0.00 0.01 C26236 0.02 0.01 0.73 0.01 0.00 0.01 0.02 0.19 C27 228 0.02 0.01 0.73 0.010.00 0.01 0.03 0.19 C28 342 0.00 0.02 0.69 0.01 0.00 0.02 0.03 0.24 C29299 0.00 0.01 0.71 0.00 0.00 0.01 0.03 0.23 C30 652 0.01 0.03 0.56 0.010.00 0.03 0.05 0.32

TABLE 24 Motif size 4 5 6 7 C1 0.33 1.00 1.00 0.14 C2 1.00 1.00 1.000.14 C3 1.00 1.00 1.00 0.14 C4 1.00 1.00 1.00 1.00 C5 0.16 1.00 1.001.00 C6 1.00 1.00 1.00 1.00 C7 1.00 1.00 1.00 1.00 C8 1.00 1.00 1.001.00 C9 1.00 1.00 1.00 1.00 C10 1.00 1.00 1.00 1.00 C11 1.00 1.00 1.001.00 C12 1.00 1.00 1.00 1.00 C13 0.99 1.00 1.00 1.00 C14 0.99 1.00 1.001.00 C15 1.00 1.00 1.00 1.00 C16 1.00 1.00 1.00 1.00 C17 0.12 0.17 1.001.00 C18 1.00 1.00 1.00 1.00 C19 0.96 0.08 1.00 1.00 C20 1.00 1.00 1.001.00 C21 1.00 1.00 0.15 1.00 C22 0.99 1.00 1.00 1.00 C23 1.00 0.08 1.001.00 C24 0.92 0.99 1.00 1.00 C25 0.11 1.00 1.00 1.00 C26 0.99 0.13 1.001.00 C27 0.99 1.00 1.00 1.00 C28 1.00 1.00 0.13 1.00 C29 0.99 0.18 0.111.00 C30 0.99 0.99 1.00 1.00

TABLE 25 Cluster |C_(i)| H B E G I T S U C1 28 0.97 0.00 0.00 0.00 0.000.02 0.01 0.01 C2 29 0.96 0.00 0.00 0.00 0.00 0.02 0.00 0.01 C3 35 0.970.00 0.00 0.00 0.00 0.01 0.01 0.01 C4 14 0.00 0.00 0.00 0.00 0.00 0.000.00 1.00 C5 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C6 14 0.00 0.000.00 0.00 0.00 0.00 0.00 1.00 C7 14 0.00 0.00 0.00 0.00 0.00 0.00 0.001.00 C8 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C9 15 0.00 0.00 0.000.00 0.00 0.00 0.00 1.00 C10 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00C11 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C12 15 0.00 0.00 0.000.00 0.00 0.00 0.00 1.00 C13 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00C14 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C15 15 0.00 0.00 0.000.00 0.00 0.00 0.00 1.00 C16 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00C17 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C18 15 0.00 0.00 0.000.00 0.00 0.00 0.00 1.00 C19 17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00C20 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C21 18 0.00 0.00 0.000.00 0.00 0.00 0.00 1.00 C22 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00C23 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C24 18 0.00 0.00 0.000.00 0.00 0.00 0.00 1.00 C25 18 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00C26 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C27 19 0.00 0.00 0.000.00 0.00 0.00 0.00 1.00 C28 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00C29 19 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 C30 19 0.00 0.00 0.000.00 0.00 0.00 0.00 1.00

TABLE 26 C_(α) C_(β) Type of motif x y z x y z B-turns 4-residues 1.945.1 23.2 2.0 45.8 22.0 4.4 43.0 25.2 4.3 42.9 26.7 3.5 39.7 23.5 2.339.0 23.9 3.8 41.0 19.9 3.4 42.3 19.4 Loops 4-residues 35.6 40.1 67.835.3 40.3 69.0 33.3 40.2 64.9 32.5 41.5 64.6 34.6 38.2 61.9 35.1 36.962.2 31.6 38.5 59.7 30.4 38.3 59.9 Mostly singleα- 4-residues 49.4 26.067.5 50.3 25.6 68.7 helix surface 46.5 23.9 66.1 45.5 23.8 67.2 50.421.1 68.6 51.5 20.0 68.7 47.0 20.9 70.2 47.2 20.9 71.7 5-residues 9.82.9 30.7 10.4 2.3 31.8 6.1 2.0 30.7 5.1 2.4 31.7 8.4 2.5 25.8 9.8 2.526.2 7.1 −1.0 26.7 8.9 −1.5 28.1 7.1 −1.9 21.6 8.2 −2.4 22.5 6-residues14.2 11.6 49.2 13.9 11.1 50.6 14.1 10.7 44.2 15.1 10.1 45.2 10.9 8.944.7 10.0 8.9 46.0 11.6 6.8 40.2 11.9 6.0 41.5 7.9 6.8 39.8 6.9 7.5 40.79.5 4.5 35.6 9.8 4.0 36.6 7-residues 23.8 10.2 60.4 22.4 10.7 60.1 23.36.7 61.9 22.8 5.5 61.1 24.6 8.8 66.3 23.1 8.6 66.0 26.1 5.4 67.2 26.34.3 66.2 26.0 6.8 72.0 25.0 5.9 71.4 28.4 8.5 76.3 27.0 9.0 75.9 28.24.9 77.4 27.8 3.8 76.6 Mostly non-singleα- 4-residues 80.5 4.4 21.7 81.14.4 20.7 helix surface 78.2 7.2 22.7 78.1 7.3 24.1 73.6 11.8 21.6 74.012.2 23.0 70.4 13.1 20.3 70.0 12.6 19.3 5-residues 14.8 −1.5 24.2 13.5−0.7 24.1 15.5 −2.5 20.6 15.9 −1.6 19.4 18.6 −4.1 21.7 20.0 −3.4 22.315.1 −13.2 19.9 14.9 −12.3 18.8 15.8 −18.6 20.1 15.4 −17.9 19.46-residues 5.1 −10.3 1.7 6.2 −10.1 0.9 3.2 −7.1 2.3 3.6 −6.3 3.5 2.2−4.8 −0.6 3.0 −3.7 −0.5 4.0 −1.6 −1.6 4.2 −1.5 −3.0 3.2 1.8 −0.1 4.3 2.60.4 0.5 3.9 −1.7 1.1 5.1 −2.3 7-residues −2.9 30.2 −0.9 −3.8 31.4 −1.2−4.9 33.4 −0.3 −5.3 33.7 1.1 −4.1 36.7 −2.1 −5.3 37.5 −2.7 −2.5 42.3 1.4−1.8 42.6 2.7 −3.0 45.6 −0.4 −4.2 46.5 −0.1 0.1 48.1 −0.9 −0.1 49.2 0.00.7 50.9 1.0 2.1 51.1 1.4

1. A computer-implemented method of producing a description of a commonthree-dimensional protein surface shape including the steps of: (i)identifying a three-dimensional surface shape of each of a plurality ofproteins; and (ii) creating one or more descriptors wherein each saiddescriptor represents a common surface shape derived from respectivesaid three-dimensional surface shapes of two or more proteins of saidplurality of proteins.
 2. The method of claim 1, wherein eachthree-dimensional surface shape identified at step (i) is represented bya side-chain location and orientation of two or more amino acids of eachsurface.
 3. The method of claim 2, wherein at step (ii) each saiddescriptor represents a common location and orientation of respectiveside chains of two or more amino acids of each of said two or moreproteins.
 4. The method of claim 3, wherein the or each said descriptorrepresents a common location and orientation of respective side chainsof three or more amino acids of each of said two or more proteins. 5.The method of claim 3, wherein the or each said descriptor representsthe common location and orientation of respective side chains of two ormore amino acids of each of three or more proteins.
 6. The method ofclaim 2 or claim 3, wherein the location and orientation of each saidamino acid side chain is in three-dimensional (3D) space.
 7. The methodof claim 6, wherein each amino acid side chain used to produce saiddescriptor is simplified as a Cα-Cβ vector.
 8. The method of claim 1,wherein each three-dimensional surface shape identified at step (i) isrepresented as a charged surface region of each said protein.
 9. Themethod of claim 8, wherein at step (ii) each said descriptor representsa common charged surface region of two or more proteins of saidplurality of proteins.
 10. The method of claim 9, wherein each chargedsurface region is represented by at least four grid points.
 11. Themethod of claim 10, wherein respective said grid points are 0.2 to 2.0angstrom apart in three dimensional (3D) space.
 12. The method of claim11, wherein respective said grid points are 0.5-1.5 angstrom apart inthree dimensional (3D) space.
 13. The method of claim 9, wherein the oreach said descriptor represents the common charged protein surface shapeof three or more proteins.
 14. The method of claim 1, wherein saidthree-dimensional surface shape is of at least part of a structuralfeature of each of said two or more proteins.
 15. The method of claim14, wherein said structural feature is, or comprises, a β-turn, a loopor a contact surface.
 16. The method of claim 14, wherein saidstructural feature is, or comprises, a loop or a contact surface. 17.The method of claim 16, wherein the contact surface comprises one ormore discontinuous and/or continuous surfaces.
 18. The method of claim3, wherein said descriptor represents side-chain location andorientation four β-turn or loop amino acids.
 19. The method of claim 3,wherein said descriptor represents side-chain location and orientationof at least three amino acids of a contact surface.
 20. The method ofclaim 19, wherein said descriptor represents side-chain location andorientation of four, five, six or seven amino acid side-chains of acontact surface.
 21. A computer-implemented method of identifying one ormore molecules having a common three-dimensional protein surface shape,said method including the steps of: (i) creating a query using one ormore descriptors that each represent a common three-dimensional proteinsurface shape; and (ii) using said query to search a database andthereby identify one or more entries in said database that correspond toone or more molecules that each match said descriptor.
 22. The method ofclaim 21, wherein at step (i), the descriptor represents a common aminoacid side-chain location and orientation of two or more amino acids ofeach of two or more proteins.
 23. The method of claim 21, wherein atstep (i), the descriptor represents a common protein surface chargeshape of two or more proteins.
 24. The method of claim 21, wherein atstep (i), the query comprises: (a) a descriptor that represents a commonside chain location and orientation of two or more amino acids of eachof two or more proteins; and (b) a descriptor that represents a commonprotein surface charge shape of said two or more proteins.
 25. Themethod of claim 21, wherein each amino acid side chain used to producesaid descriptor is simplified as a Cα-Cβ vector.
 26. The method of claim25, wherein the Cα-Cβ vectors of said query are represented as adistance matrix.
 27. A computer-implemented method of creating a libraryof molecules including the steps of: (i) searching a database toidentify one or more entries corresponding to one or more molecules thateach match a common protein surface shape; and. (ii) using at least oneof the one or more molecules identified at step (i) to create a libraryof molecules.
 28. The method of claim 27, wherein said library ofmolecules is a virtual library.
 29. The method of claim 27 said libraryof molecules is a synthetic chemical library.
 30. A method ofengineering one or more molecules including the steps of: (i) creatingone or more descriptors that each represent a common three-dimensionalprotein surface shape; and (ii) engineering one or more molecules thatrespectively comprise one or more structural features according to theor each descriptor in (i).
 31. The method of claim 30, wherein the oreach descriptor represents a common side chain location and orientationof two or more amino acids of each of two or more proteins.
 32. Themethod of claim 30, wherein the or each descriptor represents a commonprotein surface charge shape of said two or more proteins.